{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Advanced tools\n", "\n", "The previous notebook covered the most common utilities of `xsdba` for conventional cases. Here, we explore more advanced usage of `xsdba` tools.\n", "\n", "## Optimization with dask\n", "\n", "Adjustment processes can be very heavy when we need to compute them over large regions and long timeseries. Using small groupings (like `time.dayofyear`) adds precision and robustness, but also decouples the load and computing complexity. Fortunately, unlike the heroic pioneers of scientific computing who managed to write parallelized FORTRAN, we now have [dask](https://dask.org/). With only a few parameters, we can magically distribute the computing load to multiple workers and threads.\n", "\n", "A good first read on the use of dask within xarray are the latter's [Optimization tips](https://xarray.pydata.org/en/stable/user-guide/dask.html#optimization-tips).\n", "\n", "Some `xsdba`-specific tips:\n", "\n", "* Most adjustment method will need to perform operation on the whole `time` coordinate, so it is best to optimize chunking along the other dimensions. This is often different from how public data is shared, where more universal 3D chunks are used.\n", "\n", " Chunking of outputs can be controlled in xarray's [to_netcdf](https://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html?highlight=to_netcdf#xarray.Dataset.to_netcdf). We also suggest using [Zarr](https://zarr.readthedocs.io/en/stable/) files. According to [its creators](https://ui.adsabs.harvard.edu/abs/2018AGUFMIN33A..06A/abstract), `zarr` stores should give better performances, especially because of their better ability for parallel I/O. See [Dataset.to_zarr](https://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_zarr.html?highlight=to_zarr#xarray.Dataset.to_zarr) and this useful [rechunking package](https://rechunker.readthedocs.io).\n", "\n", "\n", "* One of the main bottleneck for adjustments with small groups is that dask needs to build and optimize an enormous task graph. This issue is alleviated with the use of `map_blocks` in the adjustment methods. However, not all adjustment methods use this optimized syntax.\n", "\n", " In order to help dask, one can split the processing in parts. For splitting training and adjustment, see [the section below](#Initializing-an-Adjustment-object-from-a-training-dataset).\n", "\n", "\n", "* Another massive bottleneck of parallelization of `xarray` is the thread-locking behaviour of some methods. It is quite difficult to isolate and avoid these locking instances, so one of the best workarounds is to use `dask` configurations with many _processes_ and few _threads_. The former do not share memory and thus are not impacted when a lock is activated from a thread in another worker. However, this adds many memory transfer operations and, by experience, reduces dask's ability to parallelize some pipelines. Such a dask Client is usually created with a large `n_workers` and a small `threads_per_worker`.\n", "\n", "\n", "* Sometimes, datasets have auxiliary coordinates (for example : lat / lon in a rotated pole dataset). Xarray handles these variables as data variables and will **not** load them if dask is used. However, in some operations, `xsdba` or `xarray` will trigger access to those variables, triggering computations each time, since they are `dask`-based. To avoid this behaviour, one can load the coordinates, or simply remove them from the inputs.\n", "\n", "\n", "## LOESS smoothing and detrending\n", "\n", "As described in Cleveland (1979), locally weighted linear regressions are multiple regression methods using a nearest-neighbour approach. Instead of using all data points to compute a linear or polynomial regression, LOESS algorithms compute a local regression for each point in the dataset, using only the k-nearest neighbours as selected by a weighting function. This weighting function must fulfill some strict requirements, see the doc of `xsdba.loess.loess_smoothing` for more details.\n", "\n", "In `xsdba`'s implementation, the user can choose between local _constancy_ ($d=0$, local estimates are weighted averages) and local _linearity_ ($d=1$, local estimates are taken from linear regressions). Two weighting functions are currently implemented : \"tricube\" ($w(x) = (1 - x^3)^3$) and \"gaussian\" ($w(x) = e^{-x^2 / 2\\sigma^2}$). Finally, the number of Cleveland's _robustifying iterations_ is controllable through `niter`. After computing an estimate of $y(x)$, the weights are modulated by a function of the distance between the estimate and the points and the procedure is started over. These iterations are made to weaken the effect of outliers on the estimate.\n", "\n", "The next example shows the application of the LOESS to daily temperature data. The black line and dot are the estimated $y$, outputs of the `xsdba.loess.loess_smoothing` function, using local linear regression (passing $d = 1$), a window spanning 20% ($f = 0.2$) of the domain, the \"tricube\" weighting function and only one iteration. The red curve illustrates the weighting function on January 1st 2014, where the red circles are the nearest-neighbours used in the estimation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from __future__ import annotations\n", "\n", "from tempfile import TemporaryDirectory\n", "\n", "import matplotlib.pyplot as plt\n", "import nc_time_axis\n", "import numpy as np\n", "import xarray as xr\n", "\n", "from xsdba import loess\n", "\n", "%matplotlib inline\n", "plt.style.use(\"seaborn-v0_8\")\n", "plt.rcParams[\"figure.figsize\"] = (11, 5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Daily temperature data from xarray's tutorials\n", "ds = xr.tutorial.open_dataset(\"air_temperature\").resample(time=\"D\").mean()\n", "tas = ds.isel(lat=0, lon=0).air\n", "\n", "# Compute the smoothed series\n", "f = 0.2\n", "ys = loess.loess_smoothing(tas, d=1, weights=\"tricube\", f=f, niter=1)\n", "\n", "# Plot data points and smoothed series\n", "fig, ax = plt.subplots()\n", "ax.plot(tas.time, tas, \"o\", fillstyle=\"none\")\n", "ax.plot(tas.time, ys, \"k\")\n", "ax.set_xlabel(\"Time\")\n", "ax.set_ylabel(\"Temperature [K]\")\n", "\n", "## The code below calls internal functions to demonstrate how the weights are computed.\n", "\n", "# LOESS algorithms as implemented here use scaled coordinates.\n", "x = tas.time\n", "x = (x - x[0]) / (x[-1] - x[0])\n", "xi = x[366]\n", "ti = tas.time[366]\n", "\n", "# Weighting function take the distance with all neighbors scaled by the r parameter as input\n", "r = int(f * tas.time.size)\n", "h = np.sort(np.abs(x - xi))[r]\n", "weights = loess._tricube_weighting(np.abs(x - xi).values / h)\n", "\n", "# Plot nearest neighbors and weighing function\n", "wax = ax.twinx()\n", "wax.plot(tas.time, weights, color=\"indianred\")\n", "ax.plot(tas.time, tas.where(tas * weights > 0), \"o\", color=\"lightcoral\", alpha=0.5)\n", "\n", "ax.plot(ti, ys[366], \"ko\")\n", "wax.set_ylabel(\"Weights\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "LOESS smoothing can suffer from heavy boundary effects. On the previous graph, we can associate the strange bend on the left end of the line to them. The next example shows a stronger case. Usually, $\\frac{f}{2}N$ points on each side should be discarded. On the other hand, LOESS has the advantage of always staying within the bounds of the data.\n", "\n", "\n", "### LOESS Detrending\n", "\n", "In climate science, it can be used in the detrending process. `xsdba` provides `xsdba.detrending.LoessDetrend` in order to compute trend with the LOESS smoothing and remove them from timeseries.\n", "\n", "First we create some toy data with a sinusoidal annual cycle, random noise and a linear temperature increase." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "time = xr.cftime_range(\"1990-01-01\", \"2049-12-31\", calendar=\"noleap\")\n", "tas = xr.DataArray(\n", " (\n", " 10 * np.sin(time.dayofyear * 2 * np.pi / 365)\n", " + 5 * (np.random.random_sample(time.size) - 0.5) # Annual variability\n", " + np.linspace(0, 1.5, num=time.size) # Random noise\n", " ), # 1.5 degC increase in 60 years\n", " dims=(\"time\",),\n", " coords={\"time\": time},\n", " attrs={\"units\": \"degC\"},\n", " name=\"temperature\",\n", ")\n", "tas.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we compute the trend on the data. Here, we compute on the whole timeseries (`group='time'`) with the parameters suggested above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from xsdba.detrending import LoessDetrend\n", "\n", "# Create the detrending object\n", "det = LoessDetrend(group=\"time\", d=0, niter=2, f=0.2)\n", "# Fitting returns a new object and computes the trend.\n", "fit = det.fit(tas)\n", "# Get the detrended series\n", "tas_det = fit.detrend(tas)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "fit.ds.trend.plot(ax=ax, label=\"Computed trend\")\n", "ax.plot(time, np.linspace(0, 1.5, num=time.size), label=\"Expected tred\")\n", "ax.plot([time[0], time[int(0.1 * time.size)]], [0.4, 0.4], linewidth=6, color=\"gray\")\n", "ax.plot([time[-int(0.1 * time.size)], time[-1]], [1.1, 1.1], linewidth=6, color=\"gray\")\n", "ax.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As said earlier, this example shows how the Loess has strong boundary effects. It is recommended to remove the $\\frac{f}{2}\\cdot N$ outermost points on each side, as shown by the gray bars in the graph above.\n", "\n", "## Initializing an Adjustment object from a training dataset\n", "\n", "For large scale uses, when the training step deserves its own computation and write to disk, or simply when there are multiples `sim` to be adjusted with the same training, it is helpful to be able to instantiate the Adjustment objects from the training dataset itself. This trick relies on a global attribute \"adj_params\" set on the training dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import xarray as xr\n", "\n", "# Create toy data for the example, here fake temperature timeseries\n", "t = xr.cftime_range(\"2000-01-01\", \"2030-12-31\", freq=\"D\", calendar=\"noleap\")\n", "ref = xr.DataArray(\n", " (\n", " -20 * np.cos(2 * np.pi * t.dayofyear / 365)\n", " + 2 * np.random.random_sample((t.size,))\n", " + 273.15\n", " + 0.1 * (t - t[0]).days / 365\n", " ), # \"warming\" of 1K per decade,\n", " dims=(\"time\",),\n", " coords={\"time\": t},\n", " attrs={\"units\": \"K\"},\n", ")\n", "sim = xr.DataArray(\n", " (\n", " -18 * np.cos(2 * np.pi * t.dayofyear / 365)\n", " + 2 * np.random.random_sample((t.size,))\n", " + 273.15\n", " + 0.11 * (t - t[0]).days / 365\n", " ), # \"warming\" of 1.1K per decade\n", " dims=(\"time\",),\n", " coords={\"time\": t},\n", " attrs={\"units\": \"K\"},\n", ")\n", "\n", "ref = ref.sel(time=slice(None, \"2015-01-01\"))\n", "hist = sim.sel(time=slice(None, \"2015-01-01\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from xsdba.adjustment import QuantileDeltaMapping\n", "\n", "QDM = QuantileDeltaMapping.train(\n", " ref, hist, nquantiles=15, kind=\"+\", group=\"time.dayofyear\"\n", ")\n", "QDM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The trained `QDM` exposes the training data in the `ds` attribute, Here, we will write it to disk, read it back and initialize a new object from it. Notice the `adj_params` in the dataset, that has the same value as the repr string printed just above. Also, notice the `_xsdba_adjustment` attribute that contains a JSON string, so we can rebuild the adjustment object later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "QDM.ds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The engine keyword is only needed if netCDF4 is not available\n", "with TemporaryDirectory() as tmpdir:\n", " QDM.ds.to_netcdf(f\"{tmpdir}/QDM_training.nc\", engine=\"h5netcdf\")\n", " ds = xr.open_dataset(f\"{tmpdir}/QDM_training.nc\", engine=\"h5netcdf\").load()\n", " QDM2 = QuantileDeltaMapping.from_dataset(ds)\n", " ds.close()\n", "QDM2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the case above, creating a full object from the dataset doesn't make the most sense since we are in the same python session, with the \"old\" object still available. This method effective when we reload the training data in a different python session, say on another computer. **However, take note that there is no retro-compatibility insurance.** If the `QuantileDeltaMapping` class was to change in a new `xsdba` version, one would not be able to create the new object from a dataset saved with the old one.\n", "\n", "For the case where we stay in the same Python session, it is still useful to trigger the dask computations. For small datasets, that could mean a simple `QDM.ds.load()`, but sometimes even the training data is too large to be full loaded in memory. In that case, we could also do:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The engine keyword is only needed if netCDF4 is not available\n", "with TemporaryDirectory(delete=False) as tmpdir:\n", " QDM.ds.to_netcdf(f\"{tmpdir}/QDM_training2.nc\", engine=\"h5netcdf\")\n", " ds = xr.open_dataset(f\"{tmpdir}/QDM_training2.nc\", engine=\"h5netcdf\")\n", " ds.attrs[\"title\"] = \"This is the dataset, but read from disk.\"\n", " QDM.set_dataset(ds)\n", "QDM.ds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "QDM2.adjust(sim)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrieving extra output diagnostics\n", "\n", "\n", "\n", "To fully understand what is happening during the bias-adjustment process, `xsdba` can output _diagnostic_ variables, giving more visibility to what the adjustment is doing behind the scene. This behaviour, a `verbose` option, is controlled by the `extra_output` option, set with `xsdba.set_options`. When `True`, `train` calls are instructed to include additional variables to the training datasets. In addition, the `adjust` calls will always output a dataset, with `scen` and, depending on the algorithm, other diagnostics variables. See the documentation of each `Adjustment` objects to see what extra variables are available.\n", "\n", "For the moment, this feature is still under construction and only a few `Adjustment` actually provide these extra outputs. Please open issues on the GitHub repo if you have needs or ideas of interesting diagnostic variables.\n", "\n", "For example, `QDM.adjust` adds `sim_q`, which gives the quantile of each element of `sim` within its group." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from xsdba import set_options\n", "\n", "with set_options(extra_output=True):\n", " QDM = QuantileDeltaMapping.train(\n", " ref, hist, nquantiles=15, kind=\"+\", group=\"time.dayofyear\"\n", " )\n", " out = QDM.adjust(sim)\n", "\n", "out.sim_q" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Moving window for adjustments\n", "\n", "Some Adjustment methods require that the adjusted data (`sim`) be of the same length (same number of points) than the training data (`ref` and `hist`). These requirements often ensure conservation of statistical properties and a better representation of the climate change signal over the long adjusted timeseries.\n", "\n", "In opposition to a conventional \"rolling window\", here it is the _years_ that are the base units of the window, not the elements themselves. `xsdba` implements `xsdba.stack_periods` and `xsdba.unstack_periods` to manipulate data in that goal. The \"stack\" function cuts the data in overlapping windows of a certain length and stacks them along a new `\"period\"` dimension, alike to xarray's `da.rolling(time=win).construct('period')`, but with yearly steps. The stride (or step) between each window can also be controlled. This argument is an indicator of how many years overlap between each window. With a value of `1`, a window will have `window - 1` years overlapping with the previous one. The default (`None`) is to have `stride = window` will result in no overlap at all. The default units in which `window` and `stride` are given is a year (\"YS\"), but can be changed with argument `freq`.\n", "\n", "By chunking the result along this `'period'` dimension, it is expected to be more computationally efficient (when using `dask`) than looping over the windows with a for-loop (or a `GroupyBy`)\n", "\n", "Note that this results in two restrictions:\n", "\n", "1. The constructed array has the same \"time\" axis for all windows. This is a problem if the actual _year_ is of importance for the adjustment, but this is not the case for any of `xsdba`'s current adjustment methods.\n", "2. The input timeseries must be in a calendar with uniform year lengths. For daily data, this means only the \"360_day\", \"noleap\" and \"all_leap\" calendars are supported.\n", "\n", "The \"unstack\" function does the opposite: it concatenates the windows together to recreate the original timeseries. It only works for the no-overlap case where `stride = window` and for the non-ambiguous one where `stride` divides `window` into an odd number (N) of parts. In that latter situation, the middle parts of each period are kept when reconstructing the timeseries, in addition to the first (last) parts of the first (last) period needed to get a full timeseries.\n", "\n", "Quantile Delta Mapping requires that the adjustment period should be of a length similar to the training one. As our `ref` and `hist` cover 15 years but `sim` covers 31 years, we will transform `sim` by stacking windows of 15 years. With a stride of five (5) years, this means the first window goes from 2000 to 2014 (inclusive). Then 2005-2019, 2010-2024 and 2015-2029. The last year will be dropped as it can't be included in any complete window.\n", "\n", "
\n", "\n", "In the following example, `QDM` is configured with `group=\"time.dayofyear\"` which will perform the adjustment for each day of year (doy) separately. When using `stack_periods` the extracted windows are all concatenated along the new `period` axis, and they all share the same time coordinate. As such, for the `doy` information to make sense, we must use a calendar with uniform year lengths. Otherwise, the `doy` values would shift one day at each leap year.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "QDM = QuantileDeltaMapping.train(\n", " ref, hist, nquantiles=15, kind=\"+\", group=\"time.dayofyear\"\n", ")\n", "\n", "scen_nowin = QDM.adjust(sim)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from xsdba.base import stack_periods, unstack_periods\n", "\n", "sim_win = stack_periods(sim, window=15, stride=5)\n", "sim_win" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we retrieve the full timeseries (minus the lasy year that couldn't fit in any window)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scen_win = unstack_periods(QDM.adjust(sim_win))\n", "scen_win" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Full example: Multivariate adjustment in the additive space\n", "\n", "The following example shows a complete bias-adjustment workflow using the `PrincipalComponents` method in a multi-variate configuration. Moreover, it uses the trick showed by [Alavoine et Grenier (2022)](https://doi.org/10.31223/X5C34C) to transform \"multiplicative\" variable to the \"additive\" space using log and logit transformations. This way, we can perform multi-variate adjustment with variables that couldn't be used in the same _kind_ of adjustment, like \"tas\" and \"hurs\".\n", "\n", "We will transform the variables that need it to the additive space, adding some jitter in the process to avoid $log(0)$ situations. Then, we will stack the different variables into a single `DataArray`, allowing us to use `PrincipalComponents` in a multi-variate way. Following the PCA, a simple quantile-mapping method is used, both adjustment acting on the residuals, while the mean of the simulated trend is adjusted on its own. Each step will be explained.\n", "\n", "First, open the data, convert the calendar and the units. Because we will perform adjustments on \"dayofyear\" groups (with a window), keeping standard calendars results in an extra \"dayofyear\" with only a quarter of the data. It's usual to transform to a \"noleap\" calendar, which drops the 29th of February, as it only has a small impact on the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import xsdba\n", "from xsdba.units import convert_units_to, pint_multiply\n", "from xclim.testing import open_dataset\n", "\n", "group = xsdba.Grouper(\"time.dayofyear\", window=31)\n", "\n", "dref = (\n", " open_dataset(\"sdba/ahccd_1950-2013.nc\")\n", " .convert_calendar(\"noleap\")\n", " .sel(time=slice(\"1981\", \"2010\"))\n", ")\n", "dsim = open_dataset(\"sdba/CanESM2_1950-2100.nc\")\n", "\n", "dref = dref.assign(\n", " tasmax=convert_units_to(dref.tasmax, \"K\"),\n", ")\n", "inverse_water_density = \"1e-03 m^3/kg\"\n", "dsim = dsim.assign(\n", " pr=convert_units_to(pint_multiply(dsim.pr, inverse_water_density), \"mm/d\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Jitter, additive space transformation and variable stacking\n", "Here, `tasmax` is already ready to be adjusted in an additive way, because all data points are far from the physical zero (0 K). This is not the case for `pr`, which is why we want to transform that variable to the additive space, to avoid splitting our workflow in two. For `pr` the \"log\" transformation is simply:\n", "\n", "$$ pr' = \\ln\\left(pr - b\\right) $$\n", "\n", "Where $b$ is the lower bound, here 0 mm/d. However, we could have exact zeros (0 mm/d) in the datasets, which will translate into $-\\infty$. To avoid this, we simply replace the smallest values by a random distribution of very small, but not problematic, values. In the following, all values below 0.1 mm/d are replaced by a uniform random distribution of values within the range (0, 0.1) mm/d (bounds excluded).\n", "\n", "Finally, the variables are stacked together into a single `DataArray`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dref_as = dref.assign(\n", " pr=xsdba.processing.to_additive_space(\n", " xsdba.processing.jitter(dref.pr, lower=\"0.1 mm/d\", minimum=\"0 mm/d\"),\n", " lower_bound=\"0 mm/d\",\n", " trans=\"log\",\n", " )\n", ")\n", "ref = xsdba.stack_variables(dref_as)\n", "\n", "dsim_as = dsim.assign(\n", " pr=xsdba.processing.to_additive_space(\n", " xsdba.processing.jitter(dsim.pr, lower=\"0.1 mm/d\", minimum=\"0 mm/d\"),\n", " lower_bound=\"0 mm/d\",\n", " trans=\"log\",\n", " )\n", ")\n", "sim = xsdba.stack_variables(dsim_as)\n", "sim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Get residuals and trends\n", "The adjustment will be performed on residuals only. The adjusted timeseries `sim` will be detrended with the LOESS routine described above. Because of the short length of `ref` and `hist` and the potential boundary effects of using LOESS with them, we compute the 30-year mean. In other words, instead of _detrending_ inputs, we are _normalizing_ those inputs.\n", "\n", "While the residuals are adjusted with `PrincipalComponents` and `EmpiricalQuantileMapping`, the trend of `sim` still needs to be offset according to the means of `ref` and `hist`. This is similar to what `DetrendedQuantileMapping` does. The offset step could have been done on the trend itself or at the end on `scen`, it doesn't really matter. We do it here because it keeps it close to where the `scaling` is computed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ref_res, ref_norm = xsdba.processing.normalize(ref, group=group, kind=\"+\")\n", "hist_res, hist_norm = xsdba.processing.normalize(\n", " sim.sel(time=slice(\"1981\", \"2010\")), group=group, kind=\"+\"\n", ")\n", "scaling = xsdba.utils.get_correction(hist_norm, ref_norm, kind=\"+\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sim_scaled = xsdba.utils.apply_correction(\n", " sim, xsdba.utils.broadcast(scaling, sim, group=group), kind=\"+\"\n", ")\n", "\n", "loess = xsdba.detrending.LoessDetrend(group=group, f=0.2, d=0, kind=\"+\", niter=1)\n", "simfit = loess.fit(sim_scaled)\n", "sim_res = simfit.detrend(sim_scaled)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Adjustments\n", "Following, Alavoine et Grenier (2022), we decided to perform the multivariate Principal Components adjustment first and then re-adjust with the simple Quantile-Mapping." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "PCA = xsdba.adjustment.PrincipalComponents.train(\n", " ref_res, hist_res, group=group, crd_dim=\"multivar\", best_orientation=\"simple\"\n", ")\n", "\n", "scen1_res = PCA.adjust(sim_res)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "EQM = xsdba.adjustment.EmpiricalQuantileMapping.train(\n", " ref_res,\n", " scen1_res.sel(time=slice(\"1981\", \"2010\")),\n", " group=group,\n", " nquantiles=50,\n", " kind=\"+\",\n", ")\n", "\n", "scen2_res = EQM.adjust(scen1_res, interp=\"linear\", extrapolation=\"constant\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Re-trend and transform back to the physical space\n", "Add back the trend (which includes the scaling), unstack the variables to a dataset and transform `pr` back to the physical space. All functions have conserved and handled the attributes, so we don't need to repeat the additive space bounds. The annual cycle of both variables on the reference period in Vancouver is plotted to confirm the adjustment adds a positive effect." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scen = simfit.retrend(scen2_res)\n", "dscen_as = xsdba.unstack_variables(scen)\n", "dscen = dscen_as.assign(pr=xsdba.processing.from_additive_space(dscen_as.pr))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# gather data together for plotting\n", "dsl = [\n", " ds.assign_coords({\"data_type\": lab})\n", " for ds, lab in zip([dref, dsim, dscen], [\"obs\", \"raw\", \"scen\"])\n", "]\n", "dsout = xr.concat(dsl, dim=\"data_type\")\n", "\n", "fig, axs = plt.subplots(1, 2, figsize=(12, 4))\n", "fg = (\n", " dsout.tasmax.sel(time=slice(\"1981\", \"2010\"), location=\"Vancouver\")\n", " .groupby(\"time.dayofyear\")\n", " .mean()\n", " .plot(hue=\"data_type\", ax=axs[0])\n", ")\n", "axs[0].get_legend().set_title(\"\")\n", "dsout.pr.sel(time=slice(\"1981\", \"2010\"), location=\"Vancouver\").groupby(\n", " \"time.dayofyear\"\n", ").mean().plot(hue=\"data_type\", ax=axs[1])\n", "axs[1].get_legend().set_title(\"\")" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "## Frequency adaption with a rolling window\n", "\n", "In the previous example, we performed bias adjustment with a rolling window. Here we show how to include frequency adaptation (see `example.ipynb` for the simple case `group=\"time\"`). We first generate the same precipitation dataset used in `example.ipynb`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import xarray as xr\n", "\n", "t = xr.cftime_range(\"2000-01-01\", \"2030-12-31\", freq=\"D\", calendar=\"noleap\")\n", "\n", "vals = np.random.randint(0, 1000, size=(t.size,)) / 100\n", "vals_ref = (4 ** np.where(vals < 9, vals / 100, vals)) / 3e6\n", "vals_sim = (\n", " (1 + 0.1 * np.random.random_sample((t.size,)))\n", " * (4 ** np.where(vals < 9.5, vals / 100, vals))\n", " / 3e6\n", ")\n", "\n", "pr_ref = xr.DataArray(\n", " vals_ref, coords={\"time\": t}, dims=(\"time\",), attrs={\"units\": \"mm/day\"}\n", ")\n", "pr_ref = pr_ref.sel(time=slice(\"2000\", \"2015\"))\n", "pr_sim = xr.DataArray(\n", " vals_sim, coords={\"time\": t}, dims=(\"time\",), attrs={\"units\": \"mm/day\"}\n", ")\n", "pr_hist = pr_sim.sel(time=slice(\"2000\", \"2015\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bias adjustment on a rolling window can be performed in the same way as shown in `example.ipynb`, but instead of being a single string precising the time grouping (e.g. `time.month`), the `group` argument is built with `sdba.Grouper` function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "# adapt_freq with a xsdba.Grouper\n", "import xsdba\n", "\n", "group = xsdba.Grouper(\"time.dayofyear\", window=31)\n", "hist_ad, pth, dP0 = xsdba.processing.adapt_freq(\n", " pr_ref, pr_hist, thresh=\"0.05 mm d-1\", group=group\n", ")\n", "QM_ad = xsdba.EmpiricalQuantileMapping.train(\n", " pr_ref, hist_ad, nquantiles=15, kind=\"*\", group=group\n", ")\n", "scen_ad = QM_ad.adjust(pr_sim)\n", "\n", "pr_ref.sel(time=\"2010\").plot(alpha=0.9, label=\"Reference\")\n", "pr_sim.sel(time=\"2010\").plot(alpha=0.7, label=\"Model - biased\")\n", "scen_ad.sel(time=\"2010\").plot(alpha=0.6, label=\"Model - adjusted\")\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the figure above, `scen` occasionally has small peaks where `sim` is 0, indicating that there are more \"dry days\" (days with almost no precipitation) in `hist` than in `ref`. The frequency-adaptation [Themeßl et al. (2010)](https://doi.org/10.1007/s10584-011-0224-4) performed in the step above only worked partially.\n", "\n", "The reason for this is the following. The first step above combines precipitations in 365 overlapping blocks of 31 days * Y years, one block for each day of the year. Each block is adapted, and the 16th day-of-year slice (at the center of the block) is assigned to the corresponding day-of-year in the adapted dataset `hist_ad`. As we proceed to the training, we re-form those 31 days * Y years blocks, but this step does not invert the last one: There can still be more zeroes in the simulation than in the reference.\n", "\n", "To alleviate this issue, another way of proceeding is to perform a frequency adaptation on the blocks, and then use the same blocks in the training step, as we show below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# adapt_freq directly in the training step\n", "group = xsdba.Grouper(\"time.dayofyear\", window=31)\n", "\n", "QM_ad = xsdba.EmpiricalQuantileMapping.train(\n", " pr_ref,\n", " pr_hist,\n", " nquantiles=15,\n", " kind=\"*\",\n", " group=group,\n", " adapt_freq_thresh=\"0.05 mm d-1\",\n", ")\n", "scen_ad = QM_ad.adjust(pr_sim)\n", "\n", "pr_ref.sel(time=\"2010\").plot(alpha=0.9, label=\"Reference\")\n", "pr_sim.sel(time=\"2010\").plot(alpha=0.7, label=\"Model - biased\")\n", "scen_ad.sel(time=\"2010\").plot(alpha=0.6, label=\"Model - adjusted\")\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tests for sdba\n", "\n", "It can be useful to perform diagnostic tests on adjusted simulations to assess if the bias correction method is working properly, or to compare two different bias correction techniques.\n", "\n", "A diagnostic test includes calculations of a property (mean, 20-year return value, annual cycle amplitude, ...) on the simulation and on the scenario (adjusted simulation), then a measure (bias, relative bias, ratio, ...) of the difference. Usually, the property collapse the time dimension of the simulation/scenario and returns one value by grid point.\n", "\n", "You'll find those in ``xsdba.properties`` and ``xsdba.measures``, where they are implemented as classes similar to those of `xclim`'s ``Indicator``. `xclim` needs to be installed to use these modules. This can be done simply with \n", "\n", "```shell\n", "pip install xclim\n", "```\n", "\n", "or through `xsdba`'s development environment\n", "\n", "```shell\n", "pip install xsdba['dev']\n", "\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "\n", "import xsdba\n", "from xclim.testing import open_dataset\n", "\n", "# load test data\n", "hist = open_dataset(\"sdba/CanESM2_1950-2100.nc\").sel(time=slice(\"1950\", \"1980\")).tasmax\n", "ref = open_dataset(\"sdba/nrcan_1950-2013.nc\").sel(time=slice(\"1950\", \"1980\")).tasmax\n", "sim = (\n", " open_dataset(\"sdba/CanESM2_1950-2100.nc\").sel(time=slice(\"1980\", \"2010\")).tasmax\n", ") # biased\n", "\n", "# learn the bias in historical simulation compared to reference\n", "QM = xsdba.EmpiricalQuantileMapping.train(\n", " ref, hist, nquantiles=50, group=\"time\", kind=\"+\"\n", ")\n", "\n", "# correct the bias in the future\n", "scen = QM.adjust(sim, extrapolation=\"constant\", interp=\"nearest\")\n", "ref_future = (\n", " open_dataset(\"sdba/nrcan_1950-2013.nc\").sel(time=slice(\"1980\", \"2010\")).tasmax\n", ") # truth\n", "\n", "plt.figure(figsize=(15, 5))\n", "lw = 0.3\n", "sim.isel(location=1).plot(label=\"sim\", linewidth=lw)\n", "scen.isel(location=1).plot(label=\"scen\", linewidth=lw)\n", "hist.isel(location=1).plot(label=\"hist\", linewidth=lw)\n", "ref.isel(location=1).plot(label=\"ref\", linewidth=lw)\n", "ref_future.isel(location=1).plot(label=\"ref_future\", linewidth=lw)\n", "leg = plt.legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate the mean warm Spell Length Distribution\n", "sim_prop = xsdba.properties.spell_length_distribution(\n", " da=sim, thresh=\"28 degC\", op=\">\", stat=\"mean\", group=\"time\"\n", ")\n", "\n", "\n", "scen_prop = xsdba.properties.spell_length_distribution(\n", " da=scen, thresh=\"28 degC\", op=\">\", stat=\"mean\", group=\"time\"\n", ")\n", "\n", "ref_prop = xsdba.properties.spell_length_distribution(\n", " da=ref_future, thresh=\"28 degC\", op=\">\", stat=\"mean\", group=\"time\"\n", ")\n", "# measure the difference between the prediction and the reference with an absolute bias of the properties\n", "measure_sim = xsdba.measures.bias(sim_prop, ref_prop)\n", "measure_scen = xsdba.measures.bias(scen_prop, ref_prop)\n", "\n", "plt.figure(figsize=(5, 3))\n", "plt.plot(measure_sim.location, measure_sim.values, \".\", label=\"biased model (sim)\")\n", "plt.plot(measure_scen.location, measure_scen.values, \".\", label=\"adjusted model (scen)\")\n", "plt.title(\n", " \"Bias of the mean of the warm spell\\nlength distribution compared to observations\"\n", ")\n", "plt.legend()\n", "plt.ylim(-2.5, 2.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is possible the change the 'group' of the property from 'time' to 'time.season' or 'time.month'.\n", " This will return 4 or 12 values per grid point, respectively." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate the mean warm Spell Length Distribution\n", "sim_prop, scen_prop, ref_prop = (\n", " xsdba.properties.spell_length_distribution(\n", " da=ds0, thresh=\"28 degC\", op=\">\", stat=\"mean\", group=\"time.season\"\n", " )\n", " for ds0 in [sim, scen, ref_future]\n", ")\n", "\n", "# Properties are often associated with the same measures. This correspondence is implemented in xsdba:\n", "measure = xsdba.properties.spell_length_distribution.get_measure()\n", "measure_sim = measure(sim_prop, ref_prop)\n", "measure_scen = measure(scen_prop, ref_prop)\n", "\n", "# Gather data together and plot\n", "measl = [\n", " meas.assign_coords(data_type=lab)\n", " for meas, lab in zip(\n", " [measure_sim, measure_scen], [\"biased model (sim)\", \"biased model (scen)\"]\n", " )\n", "]\n", "measure_all = xr.concat(measl, dim=\"data_type\")\n", "fg = measure_all.plot(col=\"season\", col_wrap=2, hue=\"data_type\", marker=\"o\", ls=\"\")\n", "fg.fig.subplots_adjust(top=0.9)\n", "fg.fig.suptitle(\n", " \"Bias of the mean of the warm spell length distribution compared to observations\"\n", ")\n", "fg.figlegend.set_title(\"\")\n", "for ax in fg.fig.axes:\n", " ax.set_ylim(-2.5, 2.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Spectral variance \n", "\n", "Given a field on a bi-dimensional lattice $f_{i,j}$, with $i \\in \\{1,2,\\dots, N_i\\}$ and $j \\in \\{1,2,\\dots, N_j\\}$, one can obtain its discrete cosine transform (DCT) as $F_{m,n} = \\mathcal{F}_{\\text{DCT}}(f_{i,j})$, where $m,n$ denote wavenumber for cosinus eigenmodes. This is done using `scipy`, with `scipy.fft.dctn(fij, norm='ortho')`. The variance of this field $\\sigma^2 = \\langle f_{i,j}^2 \\rangle - \\langle f_{i,j} \\rangle^2$ can be obtained using this spectral representation \n", "$$\n", "\\sigma^2 = \\frac{1}{N_i N_j} \\underset{(m,n) \\neq (0,0)}{\\sum_{m=0}^{N_i}\\sum_{n=0}^{N_j}} F_{m,n}^2 \\equiv \\underset{(m,n) \\neq (0,0)}{\\sum_{m=0}^{N_i}\\sum_{n=0}^{N_j}} \\sigma^2_{m,n}\n", "$$\n", "\n", "Defining a normalized radial wavenumber \n", "$$\n", "\\alpha_{m,n}^{0}=\\sqrt{\\left(\\frac{n}{N_i}\\right)^{2}+\\left(\\frac{m}{N_j}\\right)^{2}},\n", "$$\n", "By defining radial bands that increase in integer steps determined by the lattice size $1/\\min(N_i, N_j)$\n", "$$\n", "\\alpha(k) = \\frac{k}{\\min(N_i, N_j)}, \\quad \\alpha(k+1) = \\alpha(k) + \\Delta \\alpha(k) = \\frac{k+1}{\\min(N_i, N_j)},\n", "$$\n", "the variance spectral coefficients $\\sigma^2_{m,n}$ can be organized as\n", "$$\n", "\\sigma^2(\\alpha) = {\\sum_{m,n | \\alpha \\leq \\alpha_{m,n}^{0} \\leq \\alpha + \\Delta \\alpha}} \\sigma^2_{m,n}. \n", "$$\n", "\n", "The method `xsdba.properties.spectral_variance` takes as an input the spatial field $f_{i,j}$ and gives as an output its spectral variance $\\sigma^2(\\alpha)$.\n", "\n", "If a length scale $\\Delta$ (`delta`) is given, the wavenumber is converted to a dimensionful wavelength through the relation \n", "$$\n", "\\lambda = \\frac{2 \\Delta}{\\alpha}\n", "$$\n", "e.g. $\\Delta$ can be chosen as the nominal resolution of the model specific by the grid resolution. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import xsdba\n", "tas = xr.tutorial.load_dataset('air_temperature').isel(time=0).air\n", "\n", "# spectral variance plot\n", "var_wavenumber = xsdba.properties.spectral_variance(tas, dims=['lon','lat'])\n", "fig, axs = plt.subplots(1,2, figsize=(16,4))\n", "ax = axs[0]\n", "var_wavenumber.plot(xscale='log', yscale='log', label=\"tas\",ax=ax)\n", "\n", "# spectral variance plot — using `delta` to compute a dimensionful wavelength\n", "\n", "# Approximation of length scale from latitude np.abs(tas.lat.diff(dim='lat')[0].values)*111.2 gives 278 km\n", "# There is also a helper function \n", "delta = xsdba.processing.estimate_delta_from_cf(tas) # 278.0 km\n", "var_wavelength = xsdba.properties.spectral_variance(tas, dims=['lon','lat'], delta=delta)\n", "ax = axs[1]\n", "var_wavelength.plot(xscale='log', yscale='log', label=\"tas\", ax=ax)\n", "ax.xaxis.set_inverted(True) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DCT filter\n", "\n", "Following the DCT transform, a filter can be applied on the Fourier coefficients $F_{m,n}$ to remove contributions at a small length scale\n", "\n", "$$\n", "\\begin{align}\n", "F_{m,n}\\to T(\\alpha^{0}_{m,n})F_{m,n}\\equiv F'_{m,n}, \\quad T \\in [0,1].\n", "\\end{align}\n", "$$\n", "\n", "By default, a cosine-squared profile in $\\alpha^{0}_{m,n}$ is used for $T$. We can work directly with the normalized wavenumber to specify the bounds of the filter, e.g. $\\alpha_{\\rm low} = 0.2$ up to $\\alpha_{\\rm high} = 0.5$ used below. The bounds can also be specified with length units by specifying `lam_long, lam_short, delta`.\n", "\n", "NOTE: This cannot be used if the dataset contains nan values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Cutoff bounds for the normalized wavenumber \n", "alpha_low, alpha_high = 0.2, 0.5\n", "\n", "# applying the filter and computing variances before and after the filter\n", "tas_filtered = xsdba.processing.spectral_filter(tas, dims=['lat', 'lon'], lam_long=None, lam_short=None, delta = None, alpha_low_high=(alpha_low, alpha_high))\n", "var = xsdba.properties.spectral_variance(tas, dims=['lon','lat'])\n", "var_filtered = xsdba.properties.spectral_variance(tas_filtered, dims=['lon','lat'])\n", "\n", "# spectral variance plots\n", "var.plot(xscale='log', yscale='log', label=\"tas\")\n", "var_filtered.plot(xscale='log', yscale='log', label=\"tas (filtered)\")\n", "plt.axvline(x=alpha_low, ls=\":\", color=\"grey\", label=\"alpha-low\")\n", "plt.axvline(x=alpha_high, ls=\"--\", color=\"grey\", label=\"alpha-high\")\n", "plt.ylim(10**(-10),None)\n", "plt.title('Spectral variance of surface temperature field with and without filtering')\n", "plt.legend()\n", "\n", "# spatial plot\n", "combination = xr.concat([tas.expand_dims(type=['original']),tas_filtered.expand_dims(type=['filtered'])], dim='type')\n", "combination.plot(col='type')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As mentioned previously, quantities with units (wavelengths and a length scale) can also be used to perform the same computation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "delta = \"278 km\"\n", "lam_long = \"2780 km\" # 2 Delta / alpha_low = 2 * 278 / 0.2\n", "lam_short = \"1110 km\" # 2 Delta / alpha_high = 2 * 278 / 0.5\n", "tas_filtered = xsdba.processing.spectral_filter(tas, dims=['lat', 'lon'], lam_long=lam_long, lam_short=lam_short, delta = delta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filter extremes in DQM\n", "Sometimes a simulation can have a very large precipitation extreme that was completely unseen in the training data. If it is an order of magnitude (or more) larger than the last quantile of the reference period. It might be reasonable to leave it as is, rather than adjusting it based on a factor learned for a smaller amount of rain. This can help avoid having very large and unrealistic adjusted values.\n", "\n", "In the example below, our extreme 0.006 kg m-2 s-1 is larger than `max_tail_factor*last_quantile = 10*0.0001761 =0.001761 kg m-2 s-1`. When the filter is one, we keep the value unadjusted.\n", "\n", "Note that the last quantile is calculated on the values (not the anomalies) and before the adapt_freq is applied. In the adjustment step, the filtering is done after the adapt_freq step.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import xsdba\n", "from xclim.testing import open_dataset\n", "import xclim as xc\n", "\n", "dref = (\n", " open_dataset(\"sdba/ahccd_1950-2013.nc\")\n", " .convert_calendar(\"noleap\")\n", " .sel(time=slice(\"1981\", \"2010\"))\n", ").pr\n", "dref=xc.core.units.convert_units_to(dref, 'kg m-2 s-1', context='hydro')\n", "dsim = open_dataset(\"sdba/CanESM2_1950-2100.nc\").pr\n", "dhist=dsim.sel(time=slice('1981','2010'))\n", "# add an extreme in dsim\n", "dsim.loc[\"2050-01-01\", 'Vancouver'] = 0.006\n", "\n", "print('without filter')\n", "\n", "dqm = xsdba.DetrendedQuantileMapping.train(\n", " dref,\n", " dhist,\n", " kind=\"*\",\n", " group='time',\n", " adapt_freq_thresh='1.157 kg m-2 s-1'\n", ")\n", "dscen = dqm.adjust(dsim)\n", "print(dscen.sel(location='Vancouver', time=\"2050-01-01\").values)\n", "\n", "\n", "\n", "\n", "\n", "\n", "dqm = xsdba.DetrendedQuantileMapping.train(\n", " dref,\n", " dhist,\n", " kind=\"*\",\n", " group='time',\n", " adapt_freq_thresh='1.157 kg m-2 s-1',\n", " max_tail_factor=10\n", ")\n", "print('last quantile')\n", "print(dqm.ds.hist_q_raw.sel(location='Vancouver').isel(quantiles=-1).values)\n", "print('with filter')\n", "dscen = dqm.adjust(dsim)\n", "print(dscen.sel(location='Vancouver', time=\"2050-01-01\").values)" ] } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 4 }