10 minutes to flox#
GroupBy single variable#
import numpy as np
import xarray as xr
from flox.xarray import xarray_reduce
labels = xr.DataArray(
[1, 2, 3, 1, 2, 3, 0, 0, 0],
dims="x",
name="label",
)
labels
<xarray.DataArray 'label' (x: 9)> array([1, 2, 3, 1, 2, 3, 0, 0, 0]) Dimensions without coordinates: x
With numpy#
da = xr.DataArray(
np.ones((9,)), dims="x", name="array"
)
Apply the reduction using flox.xarray.xarray_reduce()
specifying the reduction operation in func
xarray_reduce(da, labels, func="sum")
<xarray.DataArray 'array' (label: 4)> array([3., 2., 2., 2.]) Coordinates: * label (label) int64 0 1 2 3
With dask#
Let’s first chunk da
and labels
da_chunked = da.chunk(x=2)
labels_chunked = labels.chunk(x=3)
Grouping a dask array by a numpy array is unchanged
xarray_reduce(da_chunked, labels, func="sum")
<xarray.DataArray 'array' (label: 4)> dask.array<groupby_nansum, shape=(4,), dtype=float64, chunksize=(4,), chunktype=numpy.ndarray> Coordinates: * label (label) int64 0 1 2 3
When grouping by a dask array, we need to specify the “expected group labels” on the output so we can construct the result DataArray.
Without the expected_groups
kwarg, an error is raised
xarray_reduce(da_chunked, labels_chunked, func="sum")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[6], line 1
----> 1 xarray_reduce(da_chunked, labels_chunked, func="sum")
File ~/checkouts/readthedocs.org/user_builds/flox/checkouts/latest/flox/xarray.py:340, in xarray_reduce(obj, func, expected_groups, isbin, sort, dim, fill_value, dtype, method, engine, keep_attrs, skipna, min_count, reindex, *by, **finalize_kwargs)
335 if isbin_:
336 raise ValueError(
337 f"Please provided bin edges for group variable {idx} "
338 f"named {group_name} in expected_groups."
339 )
--> 340 expect1 = _get_expected_groups(b_.data, sort=sort)
341 else:
342 expect1 = expect
File ~/checkouts/readthedocs.org/user_builds/flox/checkouts/latest/flox/core.py:105, in _get_expected_groups(by, sort)
103 def _get_expected_groups(by: T_By, sort: bool) -> T_ExpectIndex:
104 if is_duck_dask_array(by):
--> 105 raise ValueError("Please provide expected_groups if not grouping by a numpy array.")
106 flatby = by.reshape(-1)
107 expected = pd.unique(flatby[~isnull(flatby)])
ValueError: Please provide expected_groups if not grouping by a numpy array.
Now we specify expected_groups
:
dask_result = xarray_reduce(
da_chunked, labels_chunked, func="sum", expected_groups=[0, 1, 2, 3],
)
dask_result
<xarray.DataArray 'array' (label: 4)> dask.array<groupby_nansum, shape=(4,), dtype=float64, chunksize=(4,), chunktype=numpy.ndarray> Coordinates: * label (label) int64 0 1 2 3
Note that any group labels not present in expected_groups
will be ignored.
You can also provide expected_groups
for the pure numpy GroupBy.
numpy_result = xarray_reduce(
da, labels, func="sum", expected_groups=[0, 1, 2, 3],
)
numpy_result
<xarray.DataArray 'array' (label: 4)> array([3., 2., 2., 2.]) Coordinates: * label (label) int64 0 1 2 3
The two are identical:
numpy_result.identical(dask_result)
True
Binning by a single variable#
For binning, specify the bin edges in expected_groups
using pandas.IntervalIndex
:
import pandas as pd
xarray_reduce(
da,
labels,
func="sum",
expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6]),
)
<xarray.DataArray 'array' (label_bins: 3)> array([2., 2., 2.]) Coordinates: * label_bins (label_bins) object (0.5, 1.5] (1.5, 2.5] (2.5, 6.0]
Similarly for dask inputs
xarray_reduce(
da_chunked,
labels_chunked,
func="sum",
expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6]),
)
<xarray.DataArray 'array' (label_bins: 3)> dask.array<groupby_nansum, shape=(3,), dtype=float64, chunksize=(3,), chunktype=numpy.ndarray> Coordinates: * label_bins (label_bins) object (0.5, 1.5] (1.5, 2.5] (2.5, 6.0]
For more control over the binning (which edge is closed), pass the appropriate kwarg to pandas.IntervalIndex
:
xarray_reduce(
da_chunked,
labels_chunked,
func="sum",
expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6], closed="left"),
)
<xarray.DataArray 'array' (label_bins: 3)> dask.array<groupby_nansum, shape=(3,), dtype=float64, chunksize=(3,), chunktype=numpy.ndarray> Coordinates: * label_bins (label_bins) object [0.5, 1.5) [1.5, 2.5) [2.5, 6.0)
Grouping by multiple variables#
<xarray.DataArray (x: 4, y: 12)> array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]) Coordinates: labels2 (x) int64 1 2 2 1 labels1 (y) <U1 'a' 'a' 'c' 'c' 'c' 'b' 'b' 'c' 'c' 'b' 'b' 'f' Dimensions without coordinates: x, y
To group by multiple variables simply pass them as *args
:
xarray_reduce(da, "labels1", "labels2", func="sum")
<xarray.DataArray (labels1: 4, labels2: 2)> array([[ 4., 4.], [ 8., 8.], [10., 10.], [ 2., 2.]]) Coordinates: * labels1 (labels1) object 'a' 'b' 'c' 'f' * labels2 (labels2) int64 1 2
Histogramming (Binning by multiple variables)#
An unweighted histogram is simply a groupby multiple variables with count.
arr = np.ones((4, 12))
labels1 = np.array(np.linspace(0, 10, 12))
labels2 = np.array([1, 2, 2, 1])
da = xr.DataArray(
arr, dims=("x", "y"), coords={"labels2": ("x", labels2), "labels1": ("y", labels1)}
)
da
<xarray.DataArray (x: 4, y: 12)> array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]) Coordinates: labels2 (x) int64 1 2 2 1 labels1 (y) float64 0.0 0.9091 1.818 2.727 3.636 ... 7.273 8.182 9.091 10.0 Dimensions without coordinates: x, y
Specify bins in expected_groups
xarray_reduce(
da,
"labels1",
"labels2",
func="count",
expected_groups=(
pd.IntervalIndex.from_breaks([-0.5, 4.5, 6.5, 8.9]), # labels1
pd.IntervalIndex.from_breaks([0.5, 1.5, 1.9]), # labels2
),
)
<xarray.DataArray (labels1_bins: 3, labels2_bins: 2)> array([[10, 0], [ 6, 0], [ 4, 0]]) Coordinates: * labels1_bins (labels1_bins) object (-0.5, 4.5] (4.5, 6.5] (6.5, 8.9] * labels2_bins (labels2_bins) object (0.5, 1.5] (1.5, 1.9]
Resampling#
Use the xarray interface i.e. da.resample(time="M").mean()
.
Optionally pass method="blockwise"
: da.resample(time="M").mean(method="blockwise")