Binning with multi-dimensional bins#

Warning

This post is a proof-of-concept for discussion. Expect APIs to change to enable this use case.

Here we explore a binning problem where the bins are multidimensional (xhistogram issue)

One of such multi-dim bin applications is the ranked probability score rps we use in xskillscore.rps, where we want to know how many forecasts fell into which bins. Bins are often defined as terciles of the forecast distribution and the bins for these terciles (forecast_with_lon_lat_time_dims.quantile(q=[.33,.66],dim='time')) depend on lon and lat.

import math

import numpy as np
import pandas as pd
import xarray as xr

import flox
import flox.xarray

Make final result#

Now reshape that 1D result appropriately.

final = (
    interim.coarsen(by=3)
    # bin_number dimension is last, this makes sense since it is the core dimension
    # and we vectorize over the loop dims.
    # So the first (Nbins-1) elements are for the first index of the loop dim
    .construct({"by": (factorize_loop_dim, "bin_number")})
    .transpose(..., factorize_loop_dim)
    .drop_vars("by")
)
final

<xarray.DataArray 'array' (bin_number: 3, time: 3)> Size: 72B
array([[ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
Dimensions without coordinates: bin_number, time

I think this is the expected answer.

array.isel(space=slice(1, None)).rename({"space": "bin_number"}).identical(final)

True

TODO#

This could be extended to:

handle multiple factorize_loop_dim
avoid hard coded dimension names in the apply_ufunc call for factorizing
avoid hard coded number of output elements in the xarray_reduce call.
Somehow propagate the bin edges to the final output.

Binning with multi-dimensional bins#

Create test data#

Concept#

Factorizing#

Offset the codes#

Reduce#

Make final result#

TODO#