10 minutes to flox

GroupBy single variable

import numpy as np
import xarray as xr

from flox.xarray import xarray_reduce

labels = xr.DataArray(
    [1, 2, 3, 1, 2, 3, 0, 0, 0],
    dims="x",
    name="label",
)
labels
<xarray.DataArray 'label' (x: 9)> Size: 72B
array([1, 2, 3, 1, 2, 3, 0, 0, 0])
Dimensions without coordinates: x

With numpy

da = xr.DataArray(
    np.ones((9,)), dims="x", name="array"
)

Apply the reduction using flox.xarray.xarray_reduce() specifying the reduction operation in func

xarray_reduce(da, labels, func="sum")
<xarray.DataArray 'array' (label: 4)> Size: 32B
array([3., 2., 2., 2.])
Coordinates:
  * label    (label) int64 32B 0 1 2 3

With dask

Let’s first chunk da and labels

da_chunked = da.chunk(x=2)
labels_chunked = labels.chunk(x=3)

Grouping a dask array by a numpy array is unchanged

xarray_reduce(da_chunked, labels, func="sum")
<xarray.DataArray 'array' (label: 4)> Size: 32B
dask.array<groupby_nansum, shape=(4,), dtype=float64, chunksize=(1,), chunktype=numpy.ndarray>
Coordinates:
  * label    (label) int64 32B 0 1 2 3

When grouping by a dask array, we need to specify the “expected group labels” on the output so we can construct the result DataArray. Without the expected_groups kwarg, an error is raised

xarray_reduce(da_chunked, labels_chunked, func="sum")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 1
----> 1 xarray_reduce(da_chunked, labels_chunked, func="sum")

File ~/checkouts/readthedocs.org/user_builds/flox/checkouts/latest/flox/xarray.py:349, in xarray_reduce(obj, func, expected_groups, isbin, sort, dim, fill_value, dtype, method, engine, keep_attrs, skipna, min_count, reindex, *by, **finalize_kwargs)
    344     if isbin_:
    345         raise ValueError(
    346             f"Please provided bin edges for group variable {idx} "
    347             f"named {group_name} in expected_groups."
    348         )
--> 349     expect1 = _get_expected_groups(b_.data, sort=sort)
    350 else:
    351     expect1 = expect

File ~/checkouts/readthedocs.org/user_builds/flox/checkouts/latest/flox/core.py:167, in _get_expected_groups(by, sort)
    165 def _get_expected_groups(by: T_By, sort: bool) -> T_ExpectIndex:
    166     if is_duck_dask_array(by):
--> 167         raise ValueError("Please provide expected_groups if not grouping by a numpy array.")
    168     flatby = by.reshape(-1)
    169     expected = pd.unique(flatby[notnull(flatby)])

ValueError: Please provide expected_groups if not grouping by a numpy array.

Now we specify expected_groups:

dask_result = xarray_reduce(
    da_chunked, labels_chunked, func="sum", expected_groups=[0, 1, 2, 3],
)
dask_result
<xarray.DataArray 'array' (label: 4)> Size: 32B
dask.array<groupby_nansum, shape=(4,), dtype=float64, chunksize=(4,), chunktype=numpy.ndarray>
Coordinates:
  * label    (label) int64 32B 0 1 2 3

Note that any group labels not present in expected_groups will be ignored. You can also provide expected_groups for the pure numpy GroupBy.

numpy_result = xarray_reduce(
    da, labels, func="sum", expected_groups=[0, 1, 2, 3],
)
numpy_result
<xarray.DataArray 'array' (label: 4)> Size: 32B
array([3., 2., 2., 2.])
Coordinates:
  * label    (label) int64 32B 0 1 2 3

The two are identical:

numpy_result.identical(dask_result)
True

Binning by a single variable

For binning, specify the bin edges in expected_groups using pandas.IntervalIndex:

import pandas as pd

xarray_reduce(
    da,
    labels,
    func="sum",
    expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6]),
)
<xarray.DataArray 'array' (label_bins: 3)> Size: 24B
array([2., 2., 2.])
Coordinates:
  * label_bins  (label_bins) object 24B (0.5, 1.5] (1.5, 2.5] (2.5, 6.0]

Similarly for dask inputs

xarray_reduce(
    da_chunked,
    labels_chunked,
    func="sum",
    expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6]),
)
<xarray.DataArray 'array' (label_bins: 3)> Size: 24B
dask.array<groupby_nansum, shape=(3,), dtype=float64, chunksize=(3,), chunktype=numpy.ndarray>
Coordinates:
  * label_bins  (label_bins) object 24B (0.5, 1.5] (1.5, 2.5] (2.5, 6.0]

For more control over the binning (which edge is closed), pass the appropriate kwarg to pandas.IntervalIndex:

xarray_reduce(
    da_chunked,
    labels_chunked,
    func="sum",
    expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6], closed="left"),
)
<xarray.DataArray 'array' (label_bins: 3)> Size: 24B
dask.array<groupby_nansum, shape=(3,), dtype=float64, chunksize=(3,), chunktype=numpy.ndarray>
Coordinates:
  * label_bins  (label_bins) object 24B [0.5, 1.5) [1.5, 2.5) [2.5, 6.0)

Grouping by multiple variables

arr = np.ones((4, 12))
labels1 = np.array(["a", "a", "c", "c", "c", "b", "b", "c", "c", "b", "b", "f"])
labels2 = np.array([1, 2, 2, 1])

da = xr.DataArray(
    arr, dims=("x", "y"), coords={"labels2": ("x", labels2), "labels1": ("y", labels1)}
)
da
<xarray.DataArray (x: 4, y: 12)> Size: 384B
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
Coordinates:
    labels2  (x) int64 32B 1 2 2 1
    labels1  (y) <U1 48B 'a' 'a' 'c' 'c' 'c' 'b' 'b' 'c' 'c' 'b' 'b' 'f'
Dimensions without coordinates: x, y

To group by multiple variables simply pass them as *args:

xarray_reduce(da, "labels1", "labels2", func="sum")
<xarray.DataArray (labels1: 4, labels2: 2)> Size: 64B
array([[ 4.,  4.],
       [ 8.,  8.],
       [10., 10.],
       [ 2.,  2.]])
Coordinates:
  * labels1  (labels1) object 32B 'a' 'b' 'c' 'f'
  * labels2  (labels2) int64 16B 1 2

Histogramming (Binning by multiple variables)

An unweighted histogram is simply a groupby multiple variables with count.

arr = np.ones((4, 12))
labels1 = np.array(np.linspace(0, 10, 12))
labels2 = np.array([1, 2, 2, 1])

da = xr.DataArray(
    arr, dims=("x", "y"), coords={"labels2": ("x", labels2), "labels1": ("y", labels1)}
)
da
<xarray.DataArray (x: 4, y: 12)> Size: 384B
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
Coordinates:
    labels2  (x) int64 32B 1 2 2 1
    labels1  (y) float64 96B 0.0 0.9091 1.818 2.727 ... 7.273 8.182 9.091 10.0
Dimensions without coordinates: x, y

Specify bins in expected_groups

xarray_reduce(
    da,
    "labels1",
    "labels2",
    func="count",
    expected_groups=(
        pd.IntervalIndex.from_breaks([-0.5, 4.5, 6.5, 8.9]),  # labels1
        pd.IntervalIndex.from_breaks([0.5, 1.5, 1.9]),  # labels2
    ),
)
<xarray.DataArray (labels1_bins: 3, labels2_bins: 2)> Size: 48B
array([[10,  0],
       [ 6,  0],
       [ 4,  0]])
Coordinates:
  * labels1_bins  (labels1_bins) object 24B (-0.5, 4.5] (4.5, 6.5] (6.5, 8.9]
  * labels2_bins  (labels2_bins) object 16B (0.5, 1.5] (1.5, 1.9]

Resampling

Use the xarray interface i.e. da.resample(time="M").mean().

Optionally pass method="blockwise": da.resample(time="M").mean(method="blockwise")