ISIMIP Quality Checking Tool for new output data


Posted by Martin Park on Oct. 21, 2020

Dear ISIMIP Modellers,

In expectation of your new simulations we developed a quality checking tool that will allow us to test your newly generated files against the definitions, patterns and schemas from our machine-readable protocol for ISIMIP2 [1] and ISIMIP3 [2]. We hope that this will strongly increase the quality of the growing ISIMIP data archive and make it easier for you to submit well-prepared files. The idea is that you apply the tool to your own files before submitting them to DKRZ in order to lower the chance of time-consuming email conversations about inconsistencies in the submitted NetCDF files. However, if you are not able to use it we will of course proceed as before and test the files ourselves.

The tool mainly checks for proper NetCDF headers including the dimension variables, the data variable, requested global attributes, and the number of time steps to be consistent with the specifiers given in the file name.

The new tool is also able to test whether all values of a variable are valid, based on valid-value ranges that still need to be added to the ISIMIP3 protocol. For example, the tool will be able to find negative values where there should not be any and catch outliers. We will approach the sectoral coordinators soon to organize the process of collecting these ranges.

Currently the tool is not able to check non-global files and some special cases like the time unit "growing season since ..." in the agricultural sector. Also, the tool is not able to fix wrong data types of variables.

The tool has already been tested on a substantial part of ISIMIP2b data. However, we still do not expect it to work in all cases. If you experience crashes or unexpected behaviour please let us know by filing an issue on GitHub or writing an email to isimip-data@pik-potsdam.de. Please send us test files that reveal the issue. This will help us identify bugs. Your input is very much appreciated and will help stabilize the code for all of us. Please find some more technical aspects at the end of this e-mail.

Best regards,

The ISIMIP Data Team

--

[1] https://github.com/ISI-MIP/isimip-protocol-2 (this is only a bare collection of all ISIMIP2b definitions, patterns and schemas in JSON format without any experiment descriptions)

[2] https://github.com/ISI-MIP/isimip-protocol-3

[3] https://github.com/ISI-MIP/isimip-qc

--

Some technical aspects:

The tool is ready to work on Windows, MacOS and Linux machines with Python>=3.7 installed. Please find install and usage instructions on the GitHub page [3]. The only mandatory argument is the schema_path, e.g. ISIMIP3b/OutputData/water_global. This is not some kind of directory in your file system, but a schema used to check the files against the protocol and sector your data was actually generated for.

The path of your files to be checked is given by the optional argument --unchecked-path. If not given, checks will be performed on the files below your current working directory. The --fix option will try to repair or add wrongly set or missing attributes within the NetCDF header directly on your original data. The --minmax option will come into play once we've collected and implemented the valid range limits. The default log level is WARN but can be lowered with --log-level INFO to also get notified about successful tests. Logs can also be written to one log file per input file by attaching the --log-path option along with a valid directory below which the subdirectory structure found for the input path is created accordingly.