{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Large Raster Zonal Statistics\n", "\n", "\"Zonal statistics\" spans a large range of problems. \n", "\n", "This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has a global extent on a 30m grid with shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n", "\n", "The last 3 groupers are [GADM](https://gadm.org/) level 0, 1, 2 administrative area polygons rasterized to this grid; with 248, 86, and 854 unique labels respectively (arrays `adm0`, `adm1`, and `adm2`). These correspond to country-level, state-level, and county-level administrative boundaries. " ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "## Example dataset" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "Here is a representative version of the dataset (in terms of size and chunk sizes)." ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "import dask.array\n", "import numpy as np\n", "import xarray as xr\n", "\n", "from flox.xarray import xarray_reduce\n", "\n", "sizes = {\"y\": 560_000, \"x\": 1440_000}\n", "chunksizes = {\"y\": 10_000, \"x\": 10_000}\n", "dims = (\"y\", \"x\")\n", "shape = tuple(sizes[d] for d in dims)\n", "chunks = tuple(chunksizes[d] for d in dims)\n", "\n", "ds = xr.Dataset(\n", " {\n", " \"areas\": (dims, dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n", " \"tcl_year\": (\n", " dims,\n", " 1 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n", " ),\n", " \"drivers\": (dims, 2 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n", " \"tcd_thresholds\": (\n", " dims,\n", " 3 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n", " ),\n", " \"adm0\": (dims, 4 + dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n", " \"adm1\": (dims, 5 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n", " \"adm2\": (dims, 6 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n", " }\n", ")\n", "ds" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## Zonal Statistics" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "Next define the grouper arrays and expected group labels" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "by = (ds.tcl_year, ds.drivers, ds.tcd_thresholds, ds.adm0, ds.adm1, ds.adm2)\n", "expected_groups = (\n", " np.arange(23),\n", " np.arange(1, 6),\n", " np.arange(1, 8),\n", " np.arange(248),\n", " np.arange(86),\n", " np.arange(854),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "7", "metadata": {}, "outputs": [], "source": [ "result = xarray_reduce(\n", " ds.areas,\n", " *by,\n", " expected_groups=expected_groups,\n", " func=\"sum\",\n", ")\n", "result" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. The total number of GADM geometries for levels 0, 1, and 2 is ~48,000 which is much smaller than 23 x 5 x 7 x 248 x 86 x 854 = 14_662_360_160.\n", "\n", "We end up with one humoungous 56GB chunk, that is mostly empty (sparsity ~ 48,000/14_662_360_160 ~ 0.2%).\n", "\n", "## We can do better using a sparse array\n", "\n", "Since the results are very sparse, we can instruct flox to construct dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n", "\n", "```python\n", "ReindexStrategy(\n", " # do not reindex to the full output grid at the blockwise aggregation stage\n", " blockwise=False,\n", " # when combining intermediate results after blockwise aggregation, reindex to the\n", " # common grid using a sparse.COO array type\n", " array_type=ReindexArrayType.SPARSE_COO\n", ")\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "from flox import ReindexArrayType, ReindexStrategy\n", "\n", "result = xarray_reduce(\n", " ds.areas,\n", " *by,\n", " expected_groups=expected_groups,\n", " func=\"sum\",\n", " reindex=ReindexStrategy(\n", " blockwise=False,\n", " array_type=ReindexArrayType.SPARSE_COO,\n", " ),\n", " fill_value=0,\n", ")\n", "result" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "The output is a sparse array (see the **Data type** section)! Note that the size of this array cannot be estimated without computing it.\n", "\n", "The computation runs smoothly with low memory." ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "## Why\n", "\n", "To understand why you might do this, here is how flox runs reductions. In the images below, the `areas` array on the left has 5 2D chunks. Each color represents a group, each square represents a value of the array; clearly there are different groups in each chunk. \n", "\n", "\n", "### reindex = True\n", "\n", "\n", "\n", "First, the grouped-reduction is run on each chunk independently, and the results are constructed as _dense_ arrays on the full 23 x 5 x 7 x 248 x 86 x 854 output grid. This means that every chunk balloons to ~50GB. This method cannot work well.\n", "\n", "### reindex = False with sparse intermediates\n", "\n", "\n", "\n", "First, the grouped-reduction is run on each chunk independently. Conceptually the result after this step is an array with differently sized chunks. \n", "\n", "Next results from neighbouring blocks are concatenated and a reduction is run again. These results are first aligned or reindexed to a common grid of group labels, termed \"reindexing\". At this stage, we instruct flox to construct a _sparse array_ during reindexing, otherwise we will eventually end up constructing _dense_ reindexed arrays of shape 23 x 5 x 7 x 248 x 86 x 854.\n", "\n", "\n", "## Can we do better?\n", "\n", "Yes. \n", "\n", "1. Using the reindexing machinery to convert intermediates to sparse is a little bit hacky. A better option would be to aggregate directly to sparse arrays, potentially using a new `engine=\"sparse\"` ([issue](https://github.com/xarray-contrib/flox/issues/346)).\n", "2. The total number of GADM geometries for levels 0, 1, and 2 is ~48,000. A much more sensible solution would be to allow grouping by these _geometries_ directly. This would allow us to be smart about the reduction, by exploiting the ideas underlying the [`method=\"cohorts\"` strategy](../implementation.md#method-cohorts).\n", "\n", "Regardless, the ability to do such reindexing allows flox to scale to much larger grouper arrays than previously possible.\n", "\n" ] } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 5 }