{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Large Raster Zonal Statistics\n",
    "\n",
    "\"Zonal statistics\" spans a large range of problems. \n",
    "\n",
    "This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has a global extent on a 30m grid with shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n",
    "\n",
    "The last 3 groupers are [GADM](https://gadm.org/) level 0, 1, 2 administrative area polygons rasterized to this grid; with 248, 86, and 854 unique labels respectively (arrays `adm0`, `adm1`, and `adm2`). These correspond to country-level, state-level, and county-level administrative boundaries. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "## Example dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "Here is a representative version of the dataset (in terms of size and chunk sizes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import dask.array\n",
    "import numpy as np\n",
    "import xarray as xr\n",
    "\n",
    "from flox.xarray import xarray_reduce\n",
    "\n",
    "sizes = {\"y\": 560_000, \"x\": 1440_000}\n",
    "chunksizes = {\"y\": 10_000, \"x\": 10_000}\n",
    "dims = (\"y\", \"x\")\n",
    "shape = tuple(sizes[d] for d in dims)\n",
    "chunks = tuple(chunksizes[d] for d in dims)\n",
    "\n",
    "ds = xr.Dataset(\n",
    "    {\n",
    "        \"areas\": (dims, dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n",
    "        \"tcl_year\": (\n",
    "            dims,\n",
    "            1 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n",
    "        ),\n",
    "        \"drivers\": (dims, 2 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
    "        \"tcd_thresholds\": (\n",
    "            dims,\n",
    "            3 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n",
    "        ),\n",
    "        \"adm0\": (dims, 4 + dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n",
    "        \"adm1\": (dims, 5 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
    "        \"adm2\": (dims, 6 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
    "    }\n",
    ")\n",
    "ds"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "## Zonal Statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "Next define the grouper arrays and expected group labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "by = (ds.tcl_year, ds.drivers, ds.tcd_thresholds, ds.adm0, ds.adm1, ds.adm2)\n",
    "expected_groups = (\n",
    "    np.arange(23),\n",
    "    np.arange(1, 6),\n",
    "    np.arange(1, 8),\n",
    "    np.arange(248),\n",
    "    np.arange(86),\n",
    "    np.arange(854),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {},
   "outputs": [],
   "source": [
    "result = xarray_reduce(\n",
    "    ds.areas,\n",
    "    *by,\n",
    "    expected_groups=expected_groups,\n",
    "    func=\"sum\",\n",
    ")\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. The total number of GADM geometries for levels 0, 1, and 2 is ~48,000 which is much smaller than 23 x 5 x 7 x 248 x 86 x 854 = 14_662_360_160.\n",
    "\n",
    "We end up with one humoungous 56GB chunk, that is mostly empty (sparsity ~ 48,000/14_662_360_160 ~ 0.2%).\n",
    "\n",
    "## We can do better using a sparse array\n",
    "\n",
    "Since the results are very sparse, we can instruct flox to construct dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n",
    "\n",
    "```python\n",
    "ReindexStrategy(\n",
    "    # do not reindex to the full output grid at the blockwise aggregation stage\n",
    "    blockwise=False,\n",
    "    # when combining intermediate results after blockwise aggregation, reindex to the\n",
    "    # common grid using a sparse.COO array type\n",
    "    array_type=ReindexArrayType.SPARSE_COO\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from flox import ReindexArrayType, ReindexStrategy\n",
    "\n",
    "result = xarray_reduce(\n",
    "    ds.areas,\n",
    "    *by,\n",
    "    expected_groups=expected_groups,\n",
    "    func=\"sum\",\n",
    "    reindex=ReindexStrategy(\n",
    "        blockwise=False,\n",
    "        array_type=ReindexArrayType.SPARSE_COO,\n",
    "    ),\n",
    "    fill_value=0,\n",
    ")\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10",
   "metadata": {},
   "source": [
    "The output is a sparse array (see the **Data type** section)! Note that the size of this array cannot be estimated without computing it.\n",
    "\n",
    "The computation runs smoothly with low memory."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "## Why\n",
    "\n",
    "To understand why you might do this, here is how flox runs reductions. In the images below, the `areas` array on the left has 5 2D chunks. Each color represents a group, each square represents a value of the array; clearly there are different groups in each chunk. \n",
    "\n",
    "\n",
    "### reindex = True\n",
    "\n",
    "<img src=\"../_images/new-map-reduce-reindex-True-annotated.svg\" width=100%>\n",
    "\n",
    "First, the grouped-reduction is run on each chunk independently, and the results are constructed as _dense_ arrays on the full 23 x 5 x 7 x 248 x 86 x 854 output grid. This means that every chunk balloons to ~50GB. This method cannot work well.\n",
    "\n",
    "### reindex = False with sparse intermediates\n",
    "\n",
    "<img src=\"../_images/new-map-reduce-reindex-False-annotated.svg\" width=100%>\n",
    "\n",
    "First, the grouped-reduction is run on each chunk independently. Conceptually the result after this step is an array with differently sized chunks. \n",
    "\n",
    "Next results from neighbouring blocks are concatenated and a reduction is run again. These results are first aligned or reindexed to a common grid of group labels, termed \"reindexing\". At this stage, we instruct flox to construct a _sparse array_ during reindexing, otherwise we will eventually end up constructing _dense_ reindexed arrays of shape 23 x 5 x 7 x 248 x 86 x 854.\n",
    "\n",
    "\n",
    "## Can we do better?\n",
    "\n",
    "Yes. \n",
    "\n",
    "1. Using the reindexing machinery to convert intermediates to sparse is a little bit hacky. A better option would be to aggregate directly to sparse arrays, potentially using a new `engine=\"sparse\"` ([issue](https://github.com/xarray-contrib/flox/issues/346)).\n",
    "2. The total number of GADM geometries for levels 0, 1, and 2 is ~48,000. A much more sensible solution would be to allow grouping by these _geometries_ directly. This would allow us to be smart about the reduction, by exploiting the ideas underlying the [`method=\"cohorts\"` strategy](../implementation.md#method-cohorts).\n",
    "\n",
    "Regardless, the ability to do such reindexing allows flox to scale to much larger grouper arrays than previously possible.\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}