Binning with multi-dimensional bins

Warning

This post is a proof-of-concept for discussion. Expect APIs to change to enable this use case.

Here we explore a binning problem where the bins are multidimensional (xhistogram issue)

One of such multi-dim bin applications is the ranked probability score rps we use in xskillscore.rps, where we want to know how many forecasts fell into which bins. Bins are often defined as terciles of the forecast distribution and the bins for these terciles (forecast_with_lon_lat_time_dims.quantile(q=[.33,.66],dim='time')) depend on lon and lat.

import math

import numpy as np
import pandas as pd
import xarray as xr

import flox
import flox.xarray

Create test data

Data to be reduced

array = xr.DataArray(
    np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]),
    dims=("space", "time"),
    name="array",
)
array
<xarray.DataArray 'array' (space: 4, time: 3)> Size: 96B
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
Dimensions without coordinates: space, time

Array to group by

by = xr.DataArray(
    np.array([[1, 2, 3], [3, 4, 5], [5, 6, 7], [6, 7, 9]]),
    dims=("space", "time"),
    name="by",
)
by
<xarray.DataArray 'by' (space: 4, time: 3)> Size: 96B
array([[1, 2, 3],
       [3, 4, 5],
       [5, 6, 7],
       [6, 7, 9]])
Dimensions without coordinates: space, time

Multidimensional bins:

bins = by + 0.5
bins = xr.DataArray(
    np.concatenate([bins, bins[:, [-1]] + 1], axis=-1)[:, :-1].T,
    dims=("time", "nbins"),
    name="bins",
)
bins
<xarray.DataArray 'bins' (time: 3, nbins: 4)> Size: 96B
array([[1.5, 3.5, 5.5, 6.5],
       [2.5, 4.5, 6.5, 7.5],
       [3.5, 5.5, 7.5, 9.5]])
Dimensions without coordinates: time, nbins

Concept

The key idea is that GroupBy is two steps:

  1. Factorize (a.k.a “digitize”) : convert the by data to a set of integer codes representing the bins.

  2. Apply the reduction.

We treat multi-dimensional binning as a slightly complicated factorization problem. Assume that bins are a function of time. So we

  1. generate a set of appropriate integer codes by:

    1. Loop over “time” and factorize the data appropriately.

    2. Add an offset to these codes so that “bin 0” for time=0 is different from “bin 0” for time=1

  2. apply the groupby reduction to the “offset codes”

  3. reshape the output to the right shape

We will work at the xarray level, so its easy to keep track of the different dimensions.

Factorizing

The core factorize_ function (which wraps pd.cut) only handles 1D bins, so we use xr.apply_ufunc to vectorize it for us.

factorize_loop_dim = "time"
def factorize_nd_bins_core(by, bins):
    group_idx, *_, props = flox.core.factorize_(
        (by,),
        axes=(-1,),
        expected_groups=(pd.IntervalIndex.from_breaks(bins),),
    )
    # Use -1 as the NaN sentinel value
    group_idx[props.nanmask] = -1
    return group_idx


codes = xr.apply_ufunc(
    factorize_nd_bins_core,
    by,
    bins,
    # TODO: avoid hardcoded dim names
    input_core_dims=[["space"], ["nbins"]],
    output_core_dims=[["space"]],
    vectorize=True,
)
codes
<xarray.DataArray (time: 3, space: 4)> Size: 96B
array([[-1,  0,  1,  2],
       [-1,  0,  1,  2],
       [-1,  0,  1,  2]])
Dimensions without coordinates: time, space

Offset the codes

These are integer codes appropriate for a single timestep.

We now add an offset that changes in time, to make sure “bin 0” for time=0 is different from “bin 0” for time=1 (taken from this StackOverflow thread).

N = math.prod([codes.sizes[d] for d in codes.dims if d != factorize_loop_dim])
offset = xr.DataArray(np.arange(codes.sizes[factorize_loop_dim]), dims=factorize_loop_dim)
# TODO: think about N-1 here
offset_codes = (codes + offset * (N - 1)).rename(by.name)
offset_codes.data[codes == -1] = -1
offset_codes
<xarray.DataArray 'by' (time: 3, space: 4)> Size: 96B
array([[-1,  0,  1,  2],
       [-1,  3,  4,  5],
       [-1,  6,  7,  8]])
Dimensions without coordinates: time, space

Reduce

Now that we have appropriate codes, let’s apply the reduction

interim = flox.xarray.xarray_reduce(
    array,
    offset_codes,
    func="sum",
    # We use RangeIndex to indicate that `-1` code can be safely ignored
    # (it indicates values outside the bins)
    # TODO: Avoid hardcoding 9 = sizes["time"] x (sizes["nbins"] - 1)
    expected_groups=pd.RangeIndex(9),
)
interim
<xarray.DataArray 'array' (by: 9)> Size: 72B
array([ 4,  7, 10,  5,  8, 11,  6,  9, 12])
Coordinates:
  * by       (by) int64 72B 0 1 2 3 4 5 6 7 8

Make final result

Now reshape that 1D result appropriately.

final = (
    interim.coarsen(by=3)
    # bin_number dimension is last, this makes sense since it is the core dimension
    # and we vectorize over the loop dims.
    # So the first (Nbins-1) elements are for the first index of the loop dim
    .construct({"by": (factorize_loop_dim, "bin_number")})
    .transpose(..., factorize_loop_dim)
    .drop_vars("by")
)
final
<xarray.DataArray 'array' (bin_number: 3, time: 3)> Size: 72B
array([[ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
Dimensions without coordinates: bin_number, time

I think this is the expected answer.

array.isel(space=slice(1, None)).rename({"space": "bin_number"}).identical(final)
True

TODO

This could be extended to:

  1. handle multiple factorize_loop_dim

  2. avoid hard coded dimension names in the apply_ufunc call for factorizing

  3. avoid hard coded number of output elements in the xarray_reduce call.

  4. Somehow propagate the bin edges to the final output.