{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  Statistical Downscaling and Bias-Adjustment\n",
    "\n",
    "`xsdba` provides tools and utilities to ease the bias-adjustment process. Almost all adjustment algorithms conform to the `train` - `adjust` scheme, formalized within `TrainAdjust` classes. Given a reference time series (`ref`), historical simulations (`hist`) and simulations to be adjusted (`sim`), any bias-adjustment method would be applied by first estimating the adjustment factors between the historical simulation and the observation series, and then applying these factors to `sim`, which could be a future simulation.\n",
    "\n",
    "This presents examples, while a bit more info and the API are given on [this page](../xsdba.rst).\n",
    "\n",
    "## Simple Quantile Mapping\n",
    "\n",
    "A very simple \"Quantile Mapping\" approach is available through the `EmpiricalQuantileMapping` object. The object is created through the `.train` method of the class, and the simulation is adjusted with `.adjust`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import cftime  # noqa\n",
    "import matplotlib.pyplot as plt\n",
    "import nc_time_axis  # noqa\n",
    "import numpy as np\n",
    "import xarray as xr\n",
    "\n",
    "%matplotlib inline\n",
    "plt.style.use(\"seaborn-v0_8\")\n",
    "plt.rcParams[\"figure.figsize\"] = (11, 5)\n",
    "\n",
    "# Create toy data to explore bias adjustment, here fake temperature timeseries\n",
    "t = xr.cftime_range(\"2000-01-01\", \"2030-12-31\", freq=\"D\", calendar=\"noleap\")\n",
    "\n",
    "ref = xr.DataArray(\n",
    "    (\n",
    "        -20 * np.cos(2 * np.pi * t.dayofyear / 365)\n",
    "        + 2 * np.random.random_sample((t.size,))\n",
    "        + 273.15\n",
    "        + 0.1 * (t - t[0]).days / 365\n",
    "    ),  # \"warming\" of 1K per decade,\n",
    "    dims=(\"time\",),\n",
    "    coords={\"time\": t},\n",
    "    attrs={\"units\": \"K\"},\n",
    ")\n",
    "sim = xr.DataArray(\n",
    "    (\n",
    "        -18 * np.cos(2 * np.pi * t.dayofyear / 365)\n",
    "        + 2 * np.random.random_sample((t.size,))\n",
    "        + 273.15\n",
    "        + 0.11 * (t - t[0]).days / 365\n",
    "    ),  # \"warming\" of 1.1K per decade\n",
    "    dims=(\"time\",),\n",
    "    coords={\"time\": t},\n",
    "    attrs={\"units\": \"K\"},\n",
    ")\n",
    "\n",
    "ref = ref.sel(time=slice(None, \"2015-01-01\"))\n",
    "hist = sim.sel(time=slice(None, \"2015-01-01\"))\n",
    "\n",
    "ref.plot(label=\"Reference\")\n",
    "sim.plot(label=\"Model\")\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import xsdba\n",
    "\n",
    "QM = xsdba.EmpiricalQuantileMapping.train(\n",
    "    ref, hist, nquantiles=15, group=\"time\", kind=\"+\"\n",
    ")\n",
    "scen = QM.adjust(sim, extrapolation=\"constant\", interp=\"nearest\")\n",
    "\n",
    "ref.groupby(\"time.dayofyear\").mean().plot(label=\"Reference\")\n",
    "hist.groupby(\"time.dayofyear\").mean().plot(label=\"Model - biased\")\n",
    "scen.sel(time=slice(\"2000\", \"2015\")).groupby(\"time.dayofyear\").mean().plot(\n",
    "    label=\"Model - adjusted - 2000-15\", linestyle=\"--\"\n",
    ")\n",
    "scen.sel(time=slice(\"2015\", \"2030\")).groupby(\"time.dayofyear\").mean().plot(\n",
    "    label=\"Model - adjusted - 2015-30\", linestyle=\"--\"\n",
    ")\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous example, a simple Quantile Mapping algorithm was used with 15 quantiles and one group of values. The model performs well, but our toy data is also quite smooth and well-behaved so this is not surprising.\n",
    "\n",
    "A more complex example could have bias distribution varying strongly across months. To perform the adjustment with different factors for each month, one can pass `group='time.month'`. Moreover, to reduce the risk of drastic changes in the adjustments at the interface of months, `interp='linear'` can be passed to `.adjust` and the adjustment factors will be interpolated linearly (e.g.: the factors for the 1st of May will be the average of those for both April and May)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "QM_mo = xsdba.EmpiricalQuantileMapping.train(\n",
    "    ref, hist, nquantiles=15, group=\"time.month\", kind=\"+\"\n",
    ")\n",
    "scen = QM_mo.adjust(sim, extrapolation=\"constant\", interp=\"linear\")\n",
    "\n",
    "ref.groupby(\"time.dayofyear\").mean().plot(label=\"Reference\")\n",
    "hist.groupby(\"time.dayofyear\").mean().plot(label=\"Model - biased\")\n",
    "scen.sel(time=slice(\"2000\", \"2015\")).groupby(\"time.dayofyear\").mean().plot(\n",
    "    label=\"Model - adjusted - 2000-15\", linestyle=\"--\"\n",
    ")\n",
    "scen.sel(time=slice(\"2015\", \"2030\")).groupby(\"time.dayofyear\").mean().plot(\n",
    "    label=\"Model - adjusted - 2015-30\", linestyle=\"--\"\n",
    ")\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training data (here the adjustment factors) is available for inspection in the `ds` attribute of the adjustment object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "QM_mo.ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "QM_mo.ds.af.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grouping\n",
    "\n",
    "For basic time period grouping (months, day of year, season), passing a string to the methods needing it is sufficient. Most methods acting on grouped data also accept a `window` int argument to pad the groups with data from adjacent ones. Units of `window` are the sampling frequency of the main grouping dimension (usually `time`). For more complex grouping, or simply for clarity, one can pass a `xsdba.base.Grouper` directly.\n",
    "\n",
    "Another example of a simpler, adjustment method is below; Here we want `sim` to be scaled so that its mean fits the one of `ref`. Scaling factors are to be computed separately for each day of the year, but including 15 days on either side of the day. This means that the factor for the 1st of May is computed including all values from the 16th of April to the 15th of May (of all years)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "group = xsdba.Grouper(\"time.dayofyear\", window=31)\n",
    "QM_doy = xsdba.Scaling.train(ref, hist, group=group, kind=\"+\")\n",
    "scen = QM_doy.adjust(sim)\n",
    "\n",
    "ref.groupby(\"time.dayofyear\").mean().plot(label=\"Reference\")\n",
    "hist.groupby(\"time.dayofyear\").mean().plot(label=\"Model - biased\")\n",
    "scen.sel(time=slice(\"2000\", \"2015\")).groupby(\"time.dayofyear\").mean().plot(\n",
    "    label=\"Model - adjusted - 2000-15\", linestyle=\"--\"\n",
    ")\n",
    "scen.sel(time=slice(\"2015\", \"2030\")).groupby(\"time.dayofyear\").mean().plot(\n",
    "    label=\"Model - adjusted - 2015-30\", linestyle=\"--\"\n",
    ")\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "QM_doy.ds.af.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Modular approach\n",
    "\n",
    "The `xsdba` module adopts a modular approach instead of implementing published and named methods directly.\n",
    "A generic bias adjustment process is laid out as follows:\n",
    "\n",
    "- preprocessing on `ref`, `hist` and `sim` (using methods in `xsdba.processing` or `xsdba.detrending`)\n",
    "- creating and training the adjustment object `Adj = Adjustment.train(obs, hist, **kwargs)` (from `xsdba.adjustment`)\n",
    "- adjustment `scen = Adj.adjust(sim, **kwargs)`\n",
    "- post-processing on `scen` (for example: re-trending)\n",
    "\n",
    "The train-adjust approach allows us to inspect the trained adjustment object. The training information is stored in the underlying `Adj.ds` dataset and often has a `af` variable with the adjustment factors. Its layout and the other available variables vary between the different algorithm, refer to their part of the API docs.\n",
    "\n",
    "For heavy processing, this separation allows the computation and writing to disk of the training dataset before performing the adjustment(s). See the [advanced notebook](advanced_example.ipynb).\n",
    "\n",
    "Parameters needed by the training and the adjustment are saved to the `Adj.ds` dataset as a  `adj_params` attribute. For other parameters, those only needed by the adjustment are passed in the `adjust` call and written to the history attribute in the output scenario DataArray.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### First example : pr and frequency adaptation\n",
    "\n",
    "The next example generates fake precipitation data and adjusts the `sim` timeseries, but also adds a step where the dry-day frequency of `hist` is adapted so that it fits that of `ref`. This ensures well-behaved adjustment factors for the smaller quantiles. Note also that we are passing `kind='*'` to use the multiplicative mode. Adjustment factors will be multiplied/divided instead of being added/subtracted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vals = np.random.randint(0, 1000, size=(t.size,)) / 100\n",
    "vals_ref = (4 ** np.where(vals < 9, vals / 100, vals)) / 3e6\n",
    "vals_sim = (\n",
    "    (1 + 0.1 * np.random.random_sample((t.size,)))\n",
    "    * (4 ** np.where(vals < 9.5, vals / 100, vals))\n",
    "    / 3e6\n",
    ")\n",
    "\n",
    "pr_ref = xr.DataArray(\n",
    "    vals_ref, coords={\"time\": t}, dims=(\"time\",), attrs={\"units\": \"mm/day\"}\n",
    ")\n",
    "pr_ref = pr_ref.sel(time=slice(\"2000\", \"2015\"))\n",
    "pr_sim = xr.DataArray(\n",
    "    vals_sim, coords={\"time\": t}, dims=(\"time\",), attrs={\"units\": \"mm/day\"}\n",
    ")\n",
    "pr_hist = pr_sim.sel(time=slice(\"2000\", \"2015\"))\n",
    "\n",
    "pr_ref.plot(alpha=0.9, label=\"Reference\")\n",
    "pr_sim.plot(alpha=0.7, label=\"Model\")\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1st try without adapt_freq\n",
    "QM = xsdba.EmpiricalQuantileMapping.train(\n",
    "    pr_ref, pr_hist, nquantiles=15, kind=\"*\", group=\"time\"\n",
    ")\n",
    "scen = QM.adjust(pr_sim)\n",
    "\n",
    "pr_ref.sel(time=\"2010\").plot(alpha=0.9, label=\"Reference\")\n",
    "pr_hist.sel(time=\"2010\").plot(alpha=0.7, label=\"Model - biased\")\n",
    "scen.sel(time=\"2010\").plot(alpha=0.6, label=\"Model - adjusted\")\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the figure above, `scen` has small peaks where `sim` is 0. This problem originates from the fact that there are more \"dry days\" (days with almost no precipitation) in `hist` than in `ref`. The next example works around the problem using frequency-adaptation, as described in [Themeßl et al. (2012)](https://doi.org/10.1007/s10584-011-0224-4)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2nd try with adapt_freq\n",
    "hist_ad, pth, dP0 = xsdba.processing.adapt_freq(\n",
    "    pr_ref, pr_hist, thresh=\"0.05 mm d-1\", group=\"time\"\n",
    ")\n",
    "QM_ad = xsdba.EmpiricalQuantileMapping.train(\n",
    "    pr_ref, hist_ad, nquantiles=15, kind=\"*\", group=\"time\"\n",
    ")\n",
    "scen_ad = QM_ad.adjust(pr_sim)\n",
    "\n",
    "pr_ref.sel(time=\"2010\").plot(alpha=0.9, label=\"Reference\")\n",
    "pr_sim.sel(time=\"2010\").plot(alpha=0.7, label=\"Model - biased\")\n",
    "scen_ad.sel(time=\"2010\").plot(alpha=0.6, label=\"Model - adjusted\")\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Second example: tas and detrending\n",
    "\n",
    "The next example reuses the fake temperature timeseries generated at the beginning and applies the same QM adjustment method. However, for a better adjustment, we will scale sim to ref and then \"detrend\" the series, assuming the trend is linear. When `sim` (or `sim_scl`) is detrended, its values are now anomalies, so we need to normalize `ref` and `hist` so we can compare similar values.\n",
    "\n",
    "This process is detailed here to show how the `xsdba` module should be used in custom adjustment processes, but this specific method also exists as `xsdba.DetrendedQuantileMapping` and is based on [Cannon et al. 2015](https://doi.org/10.1175/JCLI-D-14-00754.1). However, `DetrendedQuantileMapping` normalizes over a `time.dayofyear` group, regardless of what is passed in the  `group` argument. As done here, it is anyway recommended to use `dayofyear` groups when normalizing, especially for variables with strong seasonal variations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "doy_win31 = xsdba.Grouper(\"time.dayofyear\", window=15)\n",
    "Sca = xsdba.Scaling.train(ref, hist, group=doy_win31, kind=\"+\")\n",
    "sim_scl = Sca.adjust(sim)\n",
    "\n",
    "detrender = xsdba.detrending.PolyDetrend(degree=1, group=\"time.dayofyear\", kind=\"+\")\n",
    "sim_fit = detrender.fit(sim_scl)\n",
    "sim_detrended = sim_fit.detrend(sim_scl)\n",
    "\n",
    "ref_n, _ = xsdba.processing.normalize(ref, group=doy_win31, kind=\"+\")\n",
    "hist_n, _ = xsdba.processing.normalize(hist, group=doy_win31, kind=\"+\")\n",
    "\n",
    "QM = xsdba.EmpiricalQuantileMapping.train(\n",
    "    ref_n, hist_n, nquantiles=15, group=\"time.month\", kind=\"+\"\n",
    ")\n",
    "scen_detrended = QM.adjust(sim_detrended, extrapolation=\"constant\", interp=\"nearest\")\n",
    "scen = sim_fit.retrend(scen_detrended)\n",
    "\n",
    "\n",
    "ref.groupby(\"time.dayofyear\").mean().plot(label=\"Reference\")\n",
    "sim.groupby(\"time.dayofyear\").mean().plot(label=\"Model - biased\")\n",
    "scen.sel(time=slice(\"2000\", \"2015\")).groupby(\"time.dayofyear\").mean().plot(\n",
    "    label=\"Model - adjusted - 2000-15\", linestyle=\"--\"\n",
    ")\n",
    "scen.sel(time=slice(\"2015\", \"2030\")).groupby(\"time.dayofyear\").mean().plot(\n",
    "    label=\"Model - adjusted - 2015-30\", linestyle=\"--\"\n",
    ")\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Third example : Multi-method protocol - Hnilica et al. 2017\n",
    "In [their paper of 2017](https://doi.org/10.1002/joc.4890), Hnilica, Hanel and Puš present a bias-adjustment method based on the principles of Principal Components Analysis.\n",
    "\n",
    "The idea is simple: use principal components to define coordinates on the reference and on the simulation, and then transform the simulation data from the latter to the former. Spatial correlation can thus be conserved by taking different points as the dimensions of the transform space. The method was demonstrated in the article by bias-adjusting precipitation over different drainage basins.\n",
    "\n",
    "The same method could be used for multivariate adjustment. The principle would be the same, concatenating the different variables into a single dataset along a new dimension. An example is given in the [advanced notebook](advanced_example.ipynb).\n",
    "\n",
    "Here we show how the modularity of `xsdba` can be used to construct a quite complex adjustment protocol involving two adjustment methods : quantile mapping and principal components. Evidently, as this example uses only 2 years of data, it is not complete. It is meant to show how the adjustment functions and how the API can be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We are using xarray's \"air_temperature\" dataset\n",
    "ds = xr.tutorial.load_dataset(\"air_temperature\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To get an exaggerated example we select different points\n",
    "# here \"lon\" will be our dimension of two \"spatially correlated\" points\n",
    "reft = ds.air.isel(lat=21, lon=[40, 52]).drop_vars([\"lon\", \"lat\"])\n",
    "simt = ds.air.isel(lat=18, lon=[17, 35]).drop_vars([\"lon\", \"lat\"])\n",
    "\n",
    "# Principal Components Adj, no grouping and use \"lon\" as the space dimensions\n",
    "PCA = xsdba.PrincipalComponents.train(reft, simt, group=\"time\", crd_dim=\"lon\")\n",
    "scen1 = PCA.adjust(simt)\n",
    "\n",
    "# QM, no grouping, 20 quantiles and additive adjustment\n",
    "EQM = xsdba.EmpiricalQuantileMapping.train(\n",
    "    reft, scen1, group=\"time\", nquantiles=50, kind=\"+\"\n",
    ")\n",
    "scen2 = EQM.adjust(scen1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# some Analysis figures\n",
    "fig = plt.figure(figsize=(12, 16))\n",
    "gs = plt.matplotlib.gridspec.GridSpec(3, 2, fig)\n",
    "\n",
    "axPCA = plt.subplot(gs[0, :])\n",
    "axPCA.scatter(reft.isel(lon=0), reft.isel(lon=1), s=20, label=\"Reference\")\n",
    "axPCA.scatter(simt.isel(lon=0), simt.isel(lon=1), s=10, label=\"Simulation\")\n",
    "axPCA.scatter(scen2.isel(lon=0), scen2.isel(lon=1), s=3, label=\"Adjusted - PCA+EQM\")\n",
    "axPCA.set_xlabel(\"Point 1\")\n",
    "axPCA.set_ylabel(\"Point 2\")\n",
    "axPCA.set_title(\"PC-space\")\n",
    "axPCA.legend()\n",
    "\n",
    "refQ = reft.quantile(EQM.ds.quantiles, dim=\"time\")\n",
    "simQ = simt.quantile(EQM.ds.quantiles, dim=\"time\")\n",
    "scen1Q = scen1.quantile(EQM.ds.quantiles, dim=\"time\")\n",
    "scen2Q = scen2.quantile(EQM.ds.quantiles, dim=\"time\")\n",
    "\n",
    "axQM = None\n",
    "for i in range(2):\n",
    "    if not axQM:\n",
    "        axQM = plt.subplot(gs[1, 0])\n",
    "    else:\n",
    "        axQM = plt.subplot(gs[1, 1], sharey=axQM)\n",
    "    axQM.plot(refQ.isel(lon=i), simQ.isel(lon=i), label=\"No adj\")\n",
    "    axQM.plot(refQ.isel(lon=i), scen1Q.isel(lon=i), label=\"PCA\")\n",
    "    axQM.plot(refQ.isel(lon=i), scen2Q.isel(lon=i), label=\"PCA+EQM\")\n",
    "    axQM.plot(\n",
    "        refQ.isel(lon=i), refQ.isel(lon=i), color=\"k\", linestyle=\":\", label=\"Ideal\"\n",
    "    )\n",
    "    axQM.set_title(f\"QQ plot - Point {i + 1}\")\n",
    "    axQM.set_xlabel(\"Reference\")\n",
    "    axQM.set_xlabel(\"Model\")\n",
    "    axQM.legend()\n",
    "\n",
    "axT = plt.subplot(gs[2, :])\n",
    "reft.isel(lon=0).plot(ax=axT, label=\"Reference\")\n",
    "simt.isel(lon=0).plot(ax=axT, label=\"Unadjusted sim\")\n",
    "# scen1.isel(lon=0).plot(ax=axT, label='PCA only')\n",
    "scen2.isel(lon=0).plot(ax=axT, label=\"PCA+EQM\")\n",
    "axT.legend()\n",
    "axT.set_title(\"Timeseries - Point 1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fourth example : Multivariate bias-adjustment (Cannon, 2018)\n",
    "\n",
    "This section replicates the \"MBCn\" algorithm described by [Cannon (2018)](https://doi.org/10.1007/s00382-017-3580-6). The method relies on some univariate algorithm, an adaption of the N-pdf transform of [Pitié et al. (2005)](https://ieeexplore.ieee.org/document/1544887/) and a final reordering step.\n",
    "\n",
    "In the following, we use the Adjusted and Homogenized Canadian Climate Dataset ([AHCCD](https://open.canada.ca/data/en/dataset/9c4ebc00-3ea4-4fe0-8bf2-66cfe1cddd1d)) and CanESM2 data as reference and simulation, respectively, and correct both `pr` and `tasmax` together.\n",
    "\n",
    "> **NOTE** This is a heavy computation. Some users report that using a manual parallelization rather than relying on dask was necessary for large datasets as the computation stalled/failed because of the large number of dask tasks. This probably is also true for the method `dOTC` presented in the next section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Perform the multivariate adjustment (MBCn)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "from xsdba.units import convert_units_to\n",
    "from xclim.testing import open_dataset\n",
    "import xclim\n",
    "\n",
    "dref = open_dataset(\"sdba/ahccd_1950-2013.nc\", drop_variables=[\"lat\", \"lon\"]).sel(\n",
    "    time=slice(\"1981\", \"2010\")\n",
    ")\n",
    "\n",
    "# Fix the standard name of the `pr` variable.\n",
    "# This allows the convert_units_to below to infer the correct CF transformation (precip rate to flux)\n",
    "# see the \"Unit handling\" notebook\n",
    "dref.pr.attrs[\"standard_name\"] = \"lwe_precipitation_rate\"\n",
    "\n",
    "# \"hydro\" context from xclim allows to convert precipitations from mm d-1 -> kg m-2 s-1\n",
    "with xclim.core.units.units.context(\"hydro\"):\n",
    "    dref[\"tasmax\"] = convert_units_to(dref.tasmax, \"K\")\n",
    "    dref[\"pr\"] = convert_units_to(dref.pr, \"kg m-2 s-1\")\n",
    "\n",
    "dsim = open_dataset(\"sdba/CanESM2_1950-2100.nc\", drop_variables=[\"lat\", \"lon\"])\n",
    "\n",
    "dhist = dsim.sel(time=slice(\"1981\", \"2010\"))\n",
    "dsim = dsim.sel(time=slice(\"2041\", \"2070\"))\n",
    "\n",
    "ref = xsdba.processing.stack_variables(dref)\n",
    "hist = xsdba.processing.stack_variables(dhist)\n",
    "sim = xsdba.processing.stack_variables(dsim)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ADJ = xsdba.MBCn.train(\n",
    "    ref,\n",
    "    hist,\n",
    "    base_kws={\"nquantiles\": 20, \"group\": \"time\"},\n",
    "    adj_kws={\"interp\": \"nearest\", \"extrapolation\": \"constant\"},\n",
    "    n_iter=20,  # perform 20 iteration\n",
    "    n_escore=1000,  # only send 1000 points to the escore metric\n",
    ")\n",
    "\n",
    "scenh, scens = (\n",
    "    ADJ.adjust(\n",
    "        sim=ds,\n",
    "        ref=ref,\n",
    "        hist=hist,\n",
    "        base=xsdba.QuantileDeltaMapping,\n",
    "        base_kws_vars={\n",
    "            \"pr\": {\n",
    "                \"kind\": \"*\",\n",
    "                \"jitter_under_thresh_value\": \"0.01 kg m-2 d-1\",\n",
    "                \"adapt_freq_thresh\": \"0.1 kg m-2 d-1\",\n",
    "            },\n",
    "            \"tasmax\": {\"kind\": \"+\"},\n",
    "        },\n",
    "        adj_kws={\"interp\": \"nearest\", \"extrapolation\": \"constant\"},\n",
    "    )\n",
    "    for ds in (hist, sim)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Let's trigger all the computations.\n",
    "\n",
    "The use of `dask.compute` allows the three DataArrays to be computed at the same time, avoiding repeating the common steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask import compute\n",
    "from dask.diagnostics import ProgressBar\n",
    "\n",
    "with ProgressBar():\n",
    "    scenh, scens, escores = compute(scenh, scens, ADJ.ds.escores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare the series and look at the distance scores to see how well the N-pdf transform has converged."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots(1, 2, figsize=(16, 4))\n",
    "for da, label in zip((ref, scenh, hist), (\"Reference\", \"Adjusted\", \"Simulated\")):\n",
    "    ds = xsdba.unstack_variables(da).isel(location=2)\n",
    "    # time series - tasmax\n",
    "    ds.tasmax.plot(ax=axs[0], label=label, alpha=0.65 if label == \"Adjusted\" else 1)\n",
    "    # scatter plot\n",
    "    ds.plot.scatter(x=\"pr\", y=\"tasmax\", ax=axs[1], label=label)\n",
    "axs[0].legend()\n",
    "axs[1].legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "escores.isel(location=2).plot()\n",
    "plt.title(\"E-scores for each iteration.\")\n",
    "plt.xlabel(\"iteration\")\n",
    "plt.ylabel(\"E-score\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The tutorial continues in the [advanced notebook](advanced_example.ipynb) with more on optimization with dask, other fancier detrending algorithms, and an example pipeline for heavy processing.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fifth example : Dynamical Optimal Transport Correction - Robin et al. 2019\n",
    "Robin, Vrac, Naveau and Yiou presented the dOTC multivariate bias correction method in a [2019 paper](https://hess.copernicus.org/articles/23/773/2019/).\n",
    "\n",
    "Here, we use optimal transport to find mappings between reference, simulated historical and simulated future data. Following these mappings, future simulation is corrected by applying the temporal evolution of model data to the reference.\n",
    "\n",
    "In the following, we use the Adjusted and Homogenized Canadian Climate Dataset ([AHCCD](https://open.canada.ca/data/en/dataset/9c4ebc00-3ea4-4fe0-8bf2-66cfe1cddd1d)) and CanESM2 data as reference and simulation, respectively, and correct both `pr` and `tasmax` together."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we are going to correct the precipitations multiplicatively to make sure they don't become negative. In this context, small precipitation values can lead to huge aberrations. This problem can be mitigated with `adapt_freq_thresh`. We also need to stack our variables into a `dataArray` before feeding them to `dOTC`.\n",
    "\n",
    "Since the precipitations are treated multiplicatively, we have no choice but to use \"std\" for the `cov_factor` argument (the default), which means the rescaling of model data to the observed data scale is done independently for every variable. In the situation where one only has additive variables, it is recommended to use the \"cholesky\" `cov_factor`, in which case the rescaling is done in a multivariate fashion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This function has random components\n",
    "np.random.seed(0)\n",
    "\n",
    "# Contrary to most algorithms in sdba, dOTC has no `train` method\n",
    "scen = xsdba.adjustment.dOTC.adjust(\n",
    "    ref,\n",
    "    hist,\n",
    "    sim,\n",
    "    kind={\n",
    "        \"pr\": \"*\"\n",
    "    },  # Since this bias correction method is multivariate, `kind` must be specified per variable\n",
    "    adapt_freq_thresh={\"pr\": \"3.5e-4 kg m-2 s-1\"},  # Idem\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Some analysis figures\n",
    "\n",
    "# Unstack variables and select a location\n",
    "ref = xsdba.processing.unstack_variables(ref).isel(location=2)\n",
    "hist = xsdba.processing.unstack_variables(hist).isel(location=2)\n",
    "sim = xsdba.processing.unstack_variables(sim).isel(location=2)\n",
    "scen = xsdba.processing.unstack_variables(scen).isel(location=2)\n",
    "\n",
    "fig = plt.figure(figsize=(10, 10))\n",
    "gs = plt.matplotlib.gridspec.GridSpec(2, 2, fig)\n",
    "ax_pr = plt.subplot(gs[0, 0])\n",
    "ax_tasmax = plt.subplot(gs[0, 1])\n",
    "ax_scatter = plt.subplot(gs[1, :])\n",
    "\n",
    "# Precipitation\n",
    "hist.pr.plot(ax=ax_pr, color=\"c\", label=\"Simulation (past)\")\n",
    "ref.pr.plot(ax=ax_pr, color=\"b\", label=\"Reference\", alpha=0.5)\n",
    "sim.pr.plot(ax=ax_pr, color=\"y\", label=\"Simulation (future)\")\n",
    "scen.pr.plot(ax=ax_pr, color=\"r\", label=\"Corrected\", alpha=0.5)\n",
    "ax_pr.set_title(\"Precipitation\")\n",
    "\n",
    "# Maximum temperature\n",
    "hist.tasmax.plot(ax=ax_tasmax, color=\"c\")\n",
    "ref.tasmax.plot(ax=ax_tasmax, color=\"b\", alpha=0.5)\n",
    "sim.tasmax.plot(ax=ax_tasmax, color=\"y\")\n",
    "scen.tasmax.plot(ax=ax_tasmax, color=\"r\", alpha=0.5)\n",
    "ax_tasmax.set_title(\"Maximum temperature\")\n",
    "\n",
    "# Scatter\n",
    "ref.plot.scatter(x=\"tasmax\", y=\"pr\", ax=ax_scatter, color=\"b\", edgecolors=\"k\", s=20)\n",
    "scen.plot.scatter(x=\"tasmax\", y=\"pr\", ax=ax_scatter, color=\"r\", edgecolors=\"k\", s=20)\n",
    "sim.plot.scatter(x=\"tasmax\", y=\"pr\", ax=ax_scatter, color=\"y\", edgecolors=\"k\", s=20)\n",
    "hist.plot.scatter(x=\"tasmax\", y=\"pr\", ax=ax_scatter, color=\"c\", edgecolors=\"k\", s=20)\n",
    "ax_scatter.set_title(\"Variables distribution\")\n",
    "\n",
    "# Example mapping\n",
    "max_time = scen.pr.idxmax().data\n",
    "max_idx = np.where(scen.time.data == max_time)[0][0]\n",
    "\n",
    "scen_x = scen.tasmax.isel(time=max_idx)\n",
    "scen_y = scen.pr.isel(time=max_idx)\n",
    "sim_x = sim.tasmax.isel(time=max_idx)\n",
    "sim_y = sim.pr.isel(time=max_idx)\n",
    "\n",
    "ax_scatter.scatter(scen_x, scen_y, color=\"r\", edgecolors=\"k\", s=30, linewidth=1)\n",
    "ax_scatter.scatter(sim_x, sim_y, color=\"y\", edgecolors=\"k\", s=30, linewidth=1)\n",
    "\n",
    "prop = dict(arrowstyle=\"-|>,head_width=0.3,head_length=0.8\", facecolor=\"black\", lw=1)\n",
    "ax_scatter.annotate(\"\", xy=(scen_x, scen_y), xytext=(sim_x, sim_y), arrowprops=prop)\n",
    "\n",
    "ax_pr.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This last plot shows the correlation between input and output per variable. Here we see a relatively strong correlation for all variables, meaning they are all taken into account when finding the optimal transport mappings. This is because we're using the (by default) `normalization = 'max_distance'` argument. Were the data not normalized, the distances along the precipitation dimension would be very small relative to the temperature distances. Precipitation values would then be spread around at very low cost and have virtually no effect on the result. See this in action with `normalization = None`.\n",
    "\n",
    "The chunks we see in the tasmax data are artefacts of the `bin_width`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import gaussian_kde\n",
    "\n",
    "fig = plt.figure(figsize=(10, 5))\n",
    "gs = plt.matplotlib.gridspec.GridSpec(1, 2, fig)\n",
    "\n",
    "tasmax = plt.subplot(gs[0, 0])\n",
    "pr = plt.subplot(gs[0, 1])\n",
    "\n",
    "sim_t = sim.tasmax.to_numpy()\n",
    "scen_t = scen.tasmax.to_numpy()\n",
    "stack = np.vstack([sim_t, scen_t])\n",
    "z = gaussian_kde(stack)(stack)\n",
    "idx = z.argsort()\n",
    "sim_t, scen_t, z = sim_t[idx], scen_t[idx], z[idx]\n",
    "tasmax.scatter(sim_t, scen_t, c=z, s=1, cmap=\"viridis\")\n",
    "tasmax.set_title(\"Tasmax\")\n",
    "tasmax.set_ylabel(\"scen tasmax\")\n",
    "tasmax.set_xlabel(\"sim tasmax\")\n",
    "\n",
    "sim_p = sim.pr.to_numpy()\n",
    "scen_p = scen.pr.to_numpy()\n",
    "stack = np.vstack([sim_p, scen_p])\n",
    "z = gaussian_kde(stack)(stack)\n",
    "idx = z.argsort()\n",
    "sim_p, scen_p, z = sim_p[idx], scen_p[idx], z[idx]\n",
    "pr.scatter(sim_p, scen_p, c=z, s=1, cmap=\"viridis\")\n",
    "pr.set_title(\"Pr\")\n",
    "pr.set_ylabel(\"scen pr\")\n",
    "pr.set_xlabel(\"sim pr\")\n",
    "\n",
    "fig.suptitle(\"Correlations between input and output per variable\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sixth example : Pooling multiple members together for quantile mapping\n",
    "Here, we perform a very simple `EmpiricalQuantileMapping` adjustment, but the adjustment factors are computed by pulling all realizations of an ensemble together. The adjustment is done over the whole ensemble, using the same factors for all realizations. This behaviour is controlled through the \"grouping\", simply by passing `add_dims=['realization']` to the constructor of the `Grouper` object. These dimensions will be reduced at the same time as the main grouping dimension (usually `time`) in the `train` step.\n",
    "\n",
    "We'll create a fake ensemble by spreading out the toy data used in example 1.3.4 and we'll only adjust the historical series to see how pooling the members changed the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create fake ensemble data by taking the same tasmax data than in example 1.3.4 and adding random noise\n",
    "ref = dref.tasmax.isel(location=0, drop=True)\n",
    "\n",
    "N_members = 10\n",
    "with xr.set_options(keep_attrs=True):  # to preserve units in the addition\n",
    "    fake_ensemble_noise = xr.DataArray(\n",
    "        20 * np.random.random_sample((N_members,)) - 10, dims=(\"realization\",)\n",
    "    )\n",
    "    hist = dhist.tasmax.isel(location=0, drop=True) + fake_ensemble_noise"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot it to see what it looks like\n",
    "hist.sel(time=slice(\"2000\", \"2005\")).plot(\n",
    "    hue=\"realization\", alpha=0.8, add_legend=False\n",
    ")\n",
    "ref.sel(time=slice(\"2000\", \"2005\")).plot(color=\"k\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train the QDM:\n",
    "group = xsdba.Grouper(\"time\", add_dims=[\"realization\"])\n",
    "EQM = xsdba.EmpiricalQuantileMapping.train(ref, hist, group=group)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we told `xsdba` to include the `realization` dimension in the reducing operations, we can see that it was in fact included in the quantile computation as it doesn't appear in the trained dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "EQM.ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scen = EQM.adjust(hist)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scen.sel(time=slice(\"2000\", \"2005\")).plot(hue=\"realization\", alpha=0.4)\n",
    "ref.sel(time=slice(\"2000\", \"2005\")).plot(color=\"k\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to see the effect of pooling the realizations together, let's redo the same adjustment without the additional argument : "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "group2 = xsdba.Grouper(\"time\")\n",
    "EQM2 = xsdba.EmpiricalQuantileMapping.train(ref, hist, group=group2)\n",
    "scen2 = EQM2.adjust(hist)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scen2.sel(time=slice(\"2000\", \"2005\")).plot(hue=\"realization\", alpha=0.4)\n",
    "ref.sel(time=slice(\"2000\", \"2005\")).plot(color=\"k\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can easily see that when the adjustment is done independently on each realization, the ensemble variability is squashed as each member is more precisely adjusted towards the reference. Pooling the members has the benefit of preserving a good part of the ensemble variability, which is sometimes interesting. Of course, in this example the ensemble has an artificial spread which doesn't look very realistic."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}