Binning with multi-dimensional bins¶
Warning
This post is a proof-of-concept for discussion. Expect APIs to change to enable this use case.
Here we explore a binning problem where the bins are multidimensional (xhistogram issue)
One of such multi-dim bin applications is the ranked probability score rps we use in
xskillscore.rps
, where we want to know how many forecasts fell into which bins. Bins are often defined as terciles of the forecast distribution and the bins for these terciles (forecast_with_lon_lat_time_dims.quantile(q=[.33,.66],dim='time')
) depend onlon
andlat
.
Create test data¶
Data to be reduced
array = xr.DataArray(
np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]),
dims=("space", "time"),
name="array",
)
array
<xarray.DataArray 'array' (space: 4, time: 3)> Size: 96B array([[ 1, 2, 3], [ 4, 5, 6], [ 7, 8, 9], [10, 11, 12]]) Dimensions without coordinates: space, time
Array to group by
by = xr.DataArray(
np.array([[1, 2, 3], [3, 4, 5], [5, 6, 7], [6, 7, 9]]),
dims=("space", "time"),
name="by",
)
by
<xarray.DataArray 'by' (space: 4, time: 3)> Size: 96B array([[1, 2, 3], [3, 4, 5], [5, 6, 7], [6, 7, 9]]) Dimensions without coordinates: space, time
Multidimensional bins:
bins = by + 0.5
bins = xr.DataArray(
np.concatenate([bins, bins[:, [-1]] + 1], axis=-1)[:, :-1].T,
dims=("time", "nbins"),
name="bins",
)
bins
<xarray.DataArray 'bins' (time: 3, nbins: 4)> Size: 96B array([[1.5, 3.5, 5.5, 6.5], [2.5, 4.5, 6.5, 7.5], [3.5, 5.5, 7.5, 9.5]]) Dimensions without coordinates: time, nbins
Concept¶
The key idea is that GroupBy is two steps:
Factorize (a.k.a “digitize”) : convert the
by
data to a set of integer codes representing the bins.Apply the reduction.
We treat multi-dimensional binning as a slightly complicated factorization
problem. Assume that bins are a function of time
. So we
generate a set of appropriate integer codes by:
Loop over “time” and factorize the data appropriately.
Add an offset to these codes so that “bin 0” for
time=0
is different from “bin 0” fortime=1
apply the groupby reduction to the “offset codes”
reshape the output to the right shape
We will work at the xarray level, so its easy to keep track of the different dimensions.
Factorizing¶
The core factorize_
function (which wraps pd.cut
) only handles 1D bins, so
we use xr.apply_ufunc
to vectorize it for us.
factorize_loop_dim = "time"
def factorize_nd_bins_core(by, bins):
group_idx, *_, props = flox.core.factorize_(
(by,),
axes=(-1,),
expected_groups=(pd.IntervalIndex.from_breaks(bins),),
)
# Use -1 as the NaN sentinel value
group_idx[props.nanmask] = -1
return group_idx
codes = xr.apply_ufunc(
factorize_nd_bins_core,
by,
bins,
# TODO: avoid hardcoded dim names
input_core_dims=[["space"], ["nbins"]],
output_core_dims=[["space"]],
vectorize=True,
)
codes
<xarray.DataArray (time: 3, space: 4)> Size: 96B array([[-1, 0, 1, 2], [-1, 0, 1, 2], [-1, 0, 1, 2]]) Dimensions without coordinates: time, space
Offset the codes¶
These are integer codes appropriate for a single timestep.
We now add an offset that changes in time, to make sure “bin 0” for time=0
is
different from “bin 0” for time=1
(taken from
this StackOverflow thread).
N = math.prod([codes.sizes[d] for d in codes.dims if d != factorize_loop_dim])
offset = xr.DataArray(np.arange(codes.sizes[factorize_loop_dim]), dims=factorize_loop_dim)
# TODO: think about N-1 here
offset_codes = (codes + offset * (N - 1)).rename(by.name)
offset_codes.data[codes == -1] = -1
offset_codes
<xarray.DataArray 'by' (time: 3, space: 4)> Size: 96B array([[-1, 0, 1, 2], [-1, 3, 4, 5], [-1, 6, 7, 8]]) Dimensions without coordinates: time, space
Reduce¶
Now that we have appropriate codes, let’s apply the reduction
interim = flox.xarray.xarray_reduce(
array,
offset_codes,
func="sum",
# We use RangeIndex to indicate that `-1` code can be safely ignored
# (it indicates values outside the bins)
# TODO: Avoid hardcoding 9 = sizes["time"] x (sizes["nbins"] - 1)
expected_groups=pd.RangeIndex(9),
)
interim
<xarray.DataArray 'array' (by: 9)> Size: 72B array([ 4, 7, 10, 5, 8, 11, 6, 9, 12]) Coordinates: * by (by) int64 72B 0 1 2 3 4 5 6 7 8
Make final result¶
Now reshape that 1D result appropriately.
final = (
interim.coarsen(by=3)
# bin_number dimension is last, this makes sense since it is the core dimension
# and we vectorize over the loop dims.
# So the first (Nbins-1) elements are for the first index of the loop dim
.construct({"by": (factorize_loop_dim, "bin_number")})
.transpose(..., factorize_loop_dim)
.drop_vars("by")
)
final
<xarray.DataArray 'array' (bin_number: 3, time: 3)> Size: 72B array([[ 4, 5, 6], [ 7, 8, 9], [10, 11, 12]]) Dimensions without coordinates: bin_number, time
I think this is the expected answer.
array.isel(space=slice(1, None)).rename({"space": "bin_number"}).identical(final)
True
TODO¶
This could be extended to:
handle multiple
factorize_loop_dim
avoid hard coded dimension names in the
apply_ufunc
call for factorizingavoid hard coded number of output elements in the
xarray_reduce
call.Somehow propagate the bin edges to the final output.