Notebooks for data registering, control and pre-processing¶
These notebooks are stored in directory select_data_versions
Creating a data versions dictionnary¶
Notebook data_selection
analyzes, for each interesting variable, each experiment of interest,
which are the available models, variants and versions at the host
computing/data center
This in line with what is described in CAMMAC principles for managing multi-model data.
This allows to provide to computing notebooks a list of all datasets available, for letting them build an ensemble. It includes checking that the data period is consistent with the definition of the experiment (or with a minimum duration for the control experiment)
The result is stored as a json file; such a file is provided with the software (see Reference data versions dictionnary); it is named after the pattern Data_versions_selection_<some_tag>.json; it is actually a dictionnary of dictionnaries of … organized that way :
>>> data_versions[experiment][variable][table][model][variant]=(grid,version,data_period)
In its present version, and only for performance purpose, that notebook code is slighlty dependent on data organization used on the ESPRI platform; however, its data inspection mechanism mainly relies on CliMAF data management and should work anywhere after a slight adapatation (in the first few cells : search for ‘/bdd’)
Creating a ‘hand-made’ data versions dictionnary¶
Notebook handmade_data_selection
is an example of how to
create a data versions dictionnary in a hard-coded mode. However, that example is based on an old structure of such dictionnaries and would have to be slightly reworked to match the actual structure
Checking ESGF errata¶
Notebook Check_errata
is
intended to automatically verify a subset of those datasets that are
registered in a data versions dictionnary, against the ESGF errata
system.
It uses a service point of this system which, at the time of CAMMAC development, was not yet fully stabilized, and which may have change and break the logic. The errata system provides mesages which have to be manually interpreted. For helping with that, the notebook organizes its output (in printed and json format) by grouping the error messages by variable, then by severity and then by error message text.
Checking data ranges¶
Notebook Check_ranges
prints user-chosen field statistics ot user-chose time statistics for
a series of variables and experiments.
It allows to detect e.g. those models which don’t use the common units. The field and time statistics are specified in CDO argument syntax, and allow to elaborate complex operations thanks to CDO operators piping syntax
Checking that data available on ESGF is locally available¶
Notebook Check_errata
queries the ESGF for latest version for a series of variables and
experiments and checks its availability on the local file system;
At the time of writing, this notebook is tune for the ESPRI computing
system and makes use of the file hierarchy known on this system. It
can be run automatically, e.g. using job datasets_stat.sh
. It prints its results and send them to a
list of email adresses
Pre-processing data for generating derived variables¶
Notebook create_derived_variable
allows to create
datafiles for some derived variables, for a selection of the
experiments (using those datasets described in a data
versions dictionnary). These derived variables are defined using CDO
operators piping syntax and can have any frequency
Note
There are other, on-the-fly, ways to create derived variables; see Variable derivation
The default settings allow to derive the annual number of dry days and the average daily rain amount (or non-dry days), from the daily precipitation data.
In order to allow for incremental processing of numerous datasets, a setting allows to avoid recomputing already existing derived data.
The notebook produces a version of the dataset versions dictionnary which is extended with the description of the derived variables; it stores the output data at a location and with a file naming convention which is fully configurable. This information on derived variables location and organization can be provided to CAMMAC by some CliMAF call such as
>>> derived_variables_pattern = "/data/ssenesi/CMIP6_derived_variables/${variable}"
>>> derived_variables_pattern += "/${variable}_${table}_${model}_${experiment}_${realization}_${grid}_${version}_${PERIOD}.nc"
>>> derived_variable_table='yr'
>>> climaf.dataloc.dataloc(project='CMIP6', organization='generic', url=derived_variables_pattern, table=derived_variable_table)
This is actually the case, by default : the first three commands are included in (relevant) notebooks parameters setting cell, and the last one in all notebooks as described there.