10 minutes to flox¶
GroupBy single variable¶
import numpy as np
import xarray as xr
from flox.xarray import xarray_reduce
labels = xr.DataArray(
[1, 2, 3, 1, 2, 3, 0, 0, 0],
dims="x",
name="label",
)
labels
<xarray.DataArray 'label' (x: 9)> Size: 72B array([1, 2, 3, 1, 2, 3, 0, 0, 0]) Dimensions without coordinates: x
With numpy¶
da = xr.DataArray(
np.ones((9,)), dims="x", name="array"
)
Apply the reduction using flox.xarray.xarray_reduce()
specifying the reduction operation in func
xarray_reduce(da, labels, func="sum")
<xarray.DataArray 'array' (label: 4)> Size: 32B array([3., 2., 2., 2.]) Coordinates: * label (label) int64 32B 0 1 2 3
With dask¶
Let’s first chunk da
and labels
da_chunked = da.chunk(x=2)
labels_chunked = labels.chunk(x=3)
Grouping a dask array by a numpy array is unchanged
xarray_reduce(da_chunked, labels, func="sum")
<xarray.DataArray 'array' (label: 4)> Size: 32B dask.array<groupby_nansum, shape=(4,), dtype=float64, chunksize=(1,), chunktype=numpy.ndarray> Coordinates: * label (label) int64 32B 0 1 2 3
When grouping by a dask array, we need to specify the “expected group labels” on the output so we can construct the result DataArray.
Without the expected_groups
kwarg, an error is raised
xarray_reduce(da_chunked, labels_chunked, func="sum")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[6], line 1
----> 1 xarray_reduce(da_chunked, labels_chunked, func="sum")
File ~/checkouts/readthedocs.org/user_builds/flox/checkouts/stable/flox/xarray.py:341, in xarray_reduce(obj, func, expected_groups, isbin, sort, dim, fill_value, dtype, method, engine, keep_attrs, skipna, min_count, reindex, *by, **finalize_kwargs)
336 if isbin_:
337 raise ValueError(
338 f"Please provided bin edges for group variable {idx} "
339 f"named {group_name} in expected_groups."
340 )
--> 341 expect1 = _get_expected_groups(b_.data, sort=sort)
342 else:
343 expect1 = expect
File ~/checkouts/readthedocs.org/user_builds/flox/checkouts/stable/flox/core.py:180, in _get_expected_groups(by, sort)
178 def _get_expected_groups(by: T_By, sort: bool) -> T_ExpectIndex:
179 if is_duck_dask_array(by):
--> 180 raise ValueError("Please provide expected_groups if not grouping by a numpy array.")
181 flatby = by.reshape(-1)
182 expected = pd.unique(flatby[notnull(flatby)])
ValueError: Please provide expected_groups if not grouping by a numpy array.
Now we specify expected_groups
:
dask_result = xarray_reduce(
da_chunked, labels_chunked, func="sum", expected_groups=[0, 1, 2, 3],
)
dask_result
<xarray.DataArray 'array' (label: 4)> Size: 32B dask.array<groupby_nansum, shape=(4,), dtype=float64, chunksize=(4,), chunktype=numpy.ndarray> Coordinates: * label (label) int64 32B 0 1 2 3
Note that any group labels not present in expected_groups
will be ignored.
You can also provide expected_groups
for the pure numpy GroupBy.
numpy_result = xarray_reduce(
da, labels, func="sum", expected_groups=[0, 1, 2, 3],
)
numpy_result
<xarray.DataArray 'array' (label: 4)> Size: 32B array([3., 2., 2., 2.]) Coordinates: * label (label) int64 32B 0 1 2 3
The two are identical:
numpy_result.identical(dask_result)
True
Binning by a single variable¶
For binning, specify the bin edges in expected_groups
using pandas.IntervalIndex
:
import pandas as pd
xarray_reduce(
da,
labels,
func="sum",
expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6]),
)
<xarray.DataArray 'array' (label_bins: 3)> Size: 24B array([2., 2., 2.]) Coordinates: * label_bins (label_bins) object 24B (0.5, 1.5] (1.5, 2.5] (2.5, 6.0]
Similarly for dask inputs
xarray_reduce(
da_chunked,
labels_chunked,
func="sum",
expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6]),
)
<xarray.DataArray 'array' (label_bins: 3)> Size: 24B dask.array<groupby_nansum, shape=(3,), dtype=float64, chunksize=(3,), chunktype=numpy.ndarray> Coordinates: * label_bins (label_bins) object 24B (0.5, 1.5] (1.5, 2.5] (2.5, 6.0]
For more control over the binning (which edge is closed), pass the appropriate kwarg to pandas.IntervalIndex
:
xarray_reduce(
da_chunked,
labels_chunked,
func="sum",
expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6], closed="left"),
)
<xarray.DataArray 'array' (label_bins: 3)> Size: 24B dask.array<groupby_nansum, shape=(3,), dtype=float64, chunksize=(3,), chunktype=numpy.ndarray> Coordinates: * label_bins (label_bins) object 24B [0.5, 1.5) [1.5, 2.5) [2.5, 6.0)
Grouping by multiple variables¶
<xarray.DataArray (x: 4, y: 12)> Size: 384B array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]) Coordinates: labels2 (x) int64 32B 1 2 2 1 labels1 (y) <U1 48B 'a' 'a' 'c' 'c' 'c' 'b' 'b' 'c' 'c' 'b' 'b' 'f' Dimensions without coordinates: x, y
To group by multiple variables simply pass them as *args
:
xarray_reduce(da, "labels1", "labels2", func="sum")
<xarray.DataArray (labels1: 4, labels2: 2)> Size: 64B array([[ 4., 4.], [ 8., 8.], [10., 10.], [ 2., 2.]]) Coordinates: * labels1 (labels1) object 32B 'a' 'b' 'c' 'f' * labels2 (labels2) int64 16B 1 2
Histogramming (Binning by multiple variables)¶
An unweighted histogram is simply a groupby multiple variables with count.
arr = np.ones((4, 12))
labels1 = np.array(np.linspace(0, 10, 12))
labels2 = np.array([1, 2, 2, 1])
da = xr.DataArray(
arr, dims=("x", "y"), coords={"labels2": ("x", labels2), "labels1": ("y", labels1)}
)
da
<xarray.DataArray (x: 4, y: 12)> Size: 384B array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]) Coordinates: labels2 (x) int64 32B 1 2 2 1 labels1 (y) float64 96B 0.0 0.9091 1.818 2.727 ... 7.273 8.182 9.091 10.0 Dimensions without coordinates: x, y
Specify bins in expected_groups
xarray_reduce(
da,
"labels1",
"labels2",
func="count",
expected_groups=(
pd.IntervalIndex.from_breaks([-0.5, 4.5, 6.5, 8.9]), # labels1
pd.IntervalIndex.from_breaks([0.5, 1.5, 1.9]), # labels2
),
)
<xarray.DataArray (labels1_bins: 3, labels2_bins: 2)> Size: 48B array([[10, 0], [ 6, 0], [ 4, 0]]) Coordinates: * labels1_bins (labels1_bins) object 24B (-0.5, 4.5] (4.5, 6.5] (6.5, 8.9] * labels2_bins (labels2_bins) object 16B (0.5, 1.5] (1.5, 1.9]
Resampling¶
Use the xarray interface i.e. da.resample(time="M").mean()
.
Optionally pass method="blockwise"
: da.resample(time="M").mean(method="blockwise")