flox: fast & furious GroupBy reductions for dask.array¶
Overview¶
flox mainly provides strategies for fast GroupBy reductions with dask.array. flox uses the MapReduce paradigm (or a “tree reduction”)
to run the GroupBy operation in a parallel-native way totally avoiding a sort or shuffle operation. It was motivated by
See a presentation (video, slides) about this package, from the Pangeo Showcase.
Why flox?¶
flox.groupby_reduce()wraps thenumpy-groupiespackage for performant Groupby reductions on nD arrays.flox.groupby_reduce()provides parallel-friendly strategies for GroupBy reductions by wrappingnumpy-groupiesfor dask arrays.floxintegrates with xarray to provide more performant Groupby and Resampling operations.flox.xarray.xarray_reduce()extends Xarray’s GroupBy operations allowing lazy grouping by dask arrays, grouping by multiple arrays, as well as combining categorical grouping and histogram-style binning operations using multiple variables.floxalso provides utility functions for rechunking both dask arrays and Xarray objects along a single dimension using the group labels as a guide:To rechunk for blockwise operations:
flox.rechunk_for_blockwise(),flox.xarray.rechunk_for_blockwise().To rechunk so that “cohorts”, or groups of labels, tend to occur in the same chunks:
flox.rechunk_for_cohorts(),flox.xarray.rechunk_for_cohorts().
Installing¶
$ pip install flox
$ conda install -c conda-forge flox
Acknowledgements¶
This work was funded in part by
NASA-ACCESS 80NSSC18M0156 “Community tools for analysis of NASA Earth Observing System Data in the Cloud” (PI J. Hamman),
NASA-OSTFL 80NSSC22K0345 “Enhancing analysis of NASA data with the open-source Python Xarray Library” (PIs Scott Henderson, University of Washington; Deepak Cherian, NCAR; Jessica Scheick, University of New Hampshire), and
It was motivated by many discussions in the Pangeo community.