10 minutes to flox#

GroupBy single variable#

import numpy as np
import xarray as xr

from flox.xarray import xarray_reduce

labels = xr.DataArray(
    [1, 2, 3, 1, 2, 3, 0, 0, 0],
    dims="x",
    name="label",
)
labels

<xarray.DataArray 'label' (x: 9)> Size: 72B
array([1, 2, 3, 1, 2, 3, 0, 0, 0])
Dimensions without coordinates: x

With numpy#

da = xr.DataArray(
    np.ones((9,)), dims="x", name="array"
)

Apply the reduction using flox.xarray.xarray_reduce() specifying the reduction operation in func

xarray_reduce(da, labels, func="sum")

<xarray.DataArray 'array' (label: 4)> Size: 32B
array([3., 2., 2., 2.])
Coordinates:
  * label    (label) int64 32B 0 1 2 3

With dask#

Let’s first chunk da and labels

da_chunked = da.chunk(x=2)
labels_chunked = labels.chunk(x=3)

Grouping a dask array by a numpy array is unchanged

xarray_reduce(da_chunked, labels, func="sum")

<xarray.DataArray 'array' (label: 4)> Size: 32B
dask.array<groupby_nansum, shape=(4,), dtype=float64, chunksize=(1,), chunktype=numpy.ndarray>
Coordinates:
  * label    (label) int64 32B 0 1 2 3

When grouping by a dask array, we need to specify the “expected group labels” on the output so we can construct the result DataArray. Without the expected_groups kwarg, an error is raised

xarray_reduce(da_chunked, labels_chunked, func="sum")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 1
----> 1 xarray_reduce(da_chunked, labels_chunked, func="sum")

File ~/checkouts/readthedocs.org/user_builds/flox/checkouts/stable/flox/xarray.py:349, in xarray_reduce(obj, func, expected_groups, isbin, sort, dim, fill_value, dtype, method, engine, keep_attrs, skipna, min_count, reindex, *by, **finalize_kwargs)
    344     if isbin_:
    345         raise ValueError(
    346             f"Please provided bin edges for group variable {idx} "
    347             f"named {group_name} in expected_groups."
    348         )
--> 349     expect1 = _get_expected_groups(b_.data, sort=sort)
    350 else:
    351     expect1 = expect

File ~/checkouts/readthedocs.org/user_builds/flox/checkouts/stable/flox/core.py:164, in _get_expected_groups(by, sort)
    162 def _get_expected_groups(by: T_By, sort: bool) -> T_ExpectIndex:
    163     if is_duck_dask_array(by):
--> 164         raise ValueError("Please provide expected_groups if not grouping by a numpy array.")
    165     flatby = by.reshape(-1)
    166     expected = pd.unique(flatby[notnull(flatby)])

ValueError: Please provide expected_groups if not grouping by a numpy array.

Now we specify expected_groups:

dask_result = xarray_reduce(
    da_chunked, labels_chunked, func="sum", expected_groups=[0, 1, 2, 3],
)
dask_result

<xarray.DataArray 'array' (label: 4)> Size: 32B
dask.array<groupby_nansum, shape=(4,), dtype=float64, chunksize=(4,), chunktype=numpy.ndarray>
Coordinates:
  * label    (label) int64 32B 0 1 2 3

Note that any group labels not present in expected_groups will be ignored. You can also provide expected_groups for the pure numpy GroupBy.

numpy_result = xarray_reduce(
    da, labels, func="sum", expected_groups=[0, 1, 2, 3],
)
numpy_result

<xarray.DataArray 'array' (label: 4)> Size: 32B
array([3., 2., 2., 2.])
Coordinates:
  * label    (label) int64 32B 0 1 2 3

The two are identical:

numpy_result.identical(dask_result)

True

Binning by a single variable#

For binning, specify the bin edges in expected_groups using pandas.IntervalIndex:

import pandas as pd

xarray_reduce(
    da,
    labels,
    func="sum",
    expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6]),
)

<xarray.DataArray 'array' (label_bins: 3)> Size: 24B
array([2., 2., 2.])
Coordinates:
  * label_bins  (label_bins) object 24B (0.5, 1.5] (1.5, 2.5] (2.5, 6.0]

Similarly for dask inputs

xarray_reduce(
    da_chunked,
    labels_chunked,
    func="sum",
    expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6]),
)

<xarray.DataArray 'array' (label_bins: 3)> Size: 24B
dask.array<groupby_nansum, shape=(3,), dtype=float64, chunksize=(3,), chunktype=numpy.ndarray>
Coordinates:
  * label_bins  (label_bins) object 24B (0.5, 1.5] (1.5, 2.5] (2.5, 6.0]

For more control over the binning (which edge is closed), pass the appropriate kwarg to pandas.IntervalIndex:

xarray_reduce(
    da_chunked,
    labels_chunked,
    func="sum",
    expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6], closed="left"),
)

<xarray.DataArray 'array' (label_bins: 3)> Size: 24B
dask.array<groupby_nansum, shape=(3,), dtype=float64, chunksize=(3,), chunktype=numpy.ndarray>
Coordinates:
  * label_bins  (label_bins) object 24B [0.5, 1.5) [1.5, 2.5) [2.5, 6.0)

Resampling#

Use the xarray interface i.e. da.resample(time="M").mean().

Optionally pass method="blockwise": da.resample(time="M").mean(method="blockwise")