{
"cells": [
{
"cell_type": "markdown",
"id": "0",
"metadata": {},
"source": [
"# Large Raster Zonal Statistics\n",
"\n",
"\"Zonal statistics\" spans a large range of problems. \n",
"\n",
"This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has a global extent on a 30m grid with shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n",
"\n",
"The last 3 groupers are [GADM](https://gadm.org/) level 0, 1, 2 administrative area polygons rasterized to this grid; with 248, 86, and 854 unique labels respectively (arrays `adm0`, `adm1`, and `adm2`). These correspond to country-level, state-level, and county-level administrative boundaries. "
]
},
{
"cell_type": "markdown",
"id": "1",
"metadata": {},
"source": [
"## Example dataset"
]
},
{
"cell_type": "markdown",
"id": "2",
"metadata": {},
"source": [
"Here is a representative version of the dataset (in terms of size and chunk sizes)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3",
"metadata": {},
"outputs": [],
"source": [
"import dask.array\n",
"import numpy as np\n",
"import xarray as xr\n",
"\n",
"from flox.xarray import xarray_reduce\n",
"\n",
"sizes = {\"y\": 560_000, \"x\": 1440_000}\n",
"chunksizes = {\"y\": 10_000, \"x\": 10_000}\n",
"dims = (\"y\", \"x\")\n",
"shape = tuple(sizes[d] for d in dims)\n",
"chunks = tuple(chunksizes[d] for d in dims)\n",
"\n",
"ds = xr.Dataset(\n",
" {\n",
" \"areas\": (dims, dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n",
" \"tcl_year\": (\n",
" dims,\n",
" 1 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n",
" ),\n",
" \"drivers\": (dims, 2 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
" \"tcd_thresholds\": (\n",
" dims,\n",
" 3 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n",
" ),\n",
" \"adm0\": (dims, 4 + dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n",
" \"adm1\": (dims, 5 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
" \"adm2\": (dims, 6 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
" }\n",
")\n",
"ds"
]
},
{
"cell_type": "markdown",
"id": "4",
"metadata": {},
"source": [
"## Zonal Statistics"
]
},
{
"cell_type": "markdown",
"id": "5",
"metadata": {},
"source": [
"Next define the grouper arrays and expected group labels"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6",
"metadata": {},
"outputs": [],
"source": [
"by = (ds.tcl_year, ds.drivers, ds.tcd_thresholds, ds.adm0, ds.adm1, ds.adm2)\n",
"expected_groups = (\n",
" np.arange(23),\n",
" np.arange(1, 6),\n",
" np.arange(1, 8),\n",
" np.arange(248),\n",
" np.arange(86),\n",
" np.arange(854),\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7",
"metadata": {},
"outputs": [],
"source": [
"result = xarray_reduce(\n",
" ds.areas,\n",
" *by,\n",
" expected_groups=expected_groups,\n",
" func=\"sum\",\n",
")\n",
"result"
]
},
{
"cell_type": "markdown",
"id": "8",
"metadata": {},
"source": [
"Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. The total number of GADM geometries for levels 0, 1, and 2 is ~48,000 which is much smaller than 23 x 5 x 7 x 248 x 86 x 854 = 14_662_360_160.\n",
"\n",
"We end up with one humoungous 56GB chunk, that is mostly empty (sparsity ~ 48,000/14_662_360_160 ~ 0.2%).\n",
"\n",
"## We can do better using a sparse array\n",
"\n",
"Since the results are very sparse, we can instruct flox to construct dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n",
"\n",
"```python\n",
"ReindexStrategy(\n",
" # do not reindex to the full output grid at the blockwise aggregation stage\n",
" blockwise=False,\n",
" # when combining intermediate results after blockwise aggregation, reindex to the\n",
" # common grid using a sparse.COO array type\n",
" array_type=ReindexArrayType.SPARSE_COO\n",
")\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9",
"metadata": {},
"outputs": [],
"source": [
"from flox import ReindexArrayType, ReindexStrategy\n",
"\n",
"result = xarray_reduce(\n",
" ds.areas,\n",
" *by,\n",
" expected_groups=expected_groups,\n",
" func=\"sum\",\n",
" reindex=ReindexStrategy(\n",
" blockwise=False,\n",
" array_type=ReindexArrayType.SPARSE_COO,\n",
" ),\n",
" fill_value=0,\n",
")\n",
"result"
]
},
{
"cell_type": "markdown",
"id": "10",
"metadata": {},
"source": [
"The output is a sparse array (see the **Data type** section)! Note that the size of this array cannot be estimated without computing it.\n",
"\n",
"The computation runs smoothly with low memory."
]
},
{
"cell_type": "markdown",
"id": "11",
"metadata": {},
"source": [
"## Why\n",
"\n",
"To understand why you might do this, here is how flox runs reductions. In the images below, the `areas` array on the left has 5 2D chunks. Each color represents a group, each square represents a value of the array; clearly there are different groups in each chunk. \n",
"\n",
"\n",
"### reindex = True\n",
"\n",
"
\n",
"\n",
"First, the grouped-reduction is run on each chunk independently, and the results are constructed as _dense_ arrays on the full 23 x 5 x 7 x 248 x 86 x 854 output grid. This means that every chunk balloons to ~50GB. This method cannot work well.\n",
"\n",
"### reindex = False with sparse intermediates\n",
"\n",
"
\n",
"\n",
"First, the grouped-reduction is run on each chunk independently. Conceptually the result after this step is an array with differently sized chunks. \n",
"\n",
"Next results from neighbouring blocks are concatenated and a reduction is run again. These results are first aligned or reindexed to a common grid of group labels, termed \"reindexing\". At this stage, we instruct flox to construct a _sparse array_ during reindexing, otherwise we will eventually end up constructing _dense_ reindexed arrays of shape 23 x 5 x 7 x 248 x 86 x 854.\n",
"\n",
"\n",
"## Can we do better?\n",
"\n",
"Yes. \n",
"\n",
"1. Using the reindexing machinery to convert intermediates to sparse is a little bit hacky. A better option would be to aggregate directly to sparse arrays, potentially using a new `engine=\"sparse\"` ([issue](https://github.com/xarray-contrib/flox/issues/346)).\n",
"2. The total number of GADM geometries for levels 0, 1, and 2 is ~48,000. A much more sensible solution would be to allow grouping by these _geometries_ directly. This would allow us to be smart about the reduction, by exploiting the ideas underlying the [`method=\"cohorts\"` strategy](../implementation.md#method-cohorts).\n",
"\n",
"Regardless, the ability to do such reindexing allows flox to scale to much larger grouper arrays than previously possible.\n",
"\n"
]
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}