Lessons Learned from NARCCAP About Data Archiving and Quality Control
Author: Seth McGinnis
I am the Data Manager for NARCCAP. It has been my responsibility to coordinate the submission of model output from multiple sources, to quality-check it, and to publish it through the ESG data portal. Overall, our archiving workflow was pretty good, but there were some pitfalls. I had to construct the QC process on the fly. I learned a lot, and I want to share that hard-won knowledge with the community. The NARCCAP specification documents are a good starting point for organizing the output from a collaborative climate modeling project. In particular, the choice to archive data in netCDF format conforming to the CF metadata standard was absolutely key. This document adds on to those specs, and is written from the perspective of the question: if we were going to do it all over, knowing what I know now, what would we do differently?
The only way to deal with these volumes of data is to automate everything possible in the QC and publishing process. Ideally, the automated steps can be exposed to the modelers, taking the QC team out of the loop entirely and speeding up the refinement process.
The more uniform the files are, the easier it is to automate QC and analysis. There's already sufficient variation needed to properly capture the details of the different source models that you want to eliminate every other iota of difference that you possibly can. If some piece of data or metadata is common to more than one file, it should be identical in absolutely every instance.
The CF metadata standard solves a whole lot of interoperability problems, but not if you're sloppy about it. Data must conform stringently to the CF standard. This means that someone on the QC team will need to develop a thorough understanding of the CF spec.
Aim for a one-to-one correspondence between files and output fields. Splitting variables across files creates opportunities for inconsistency, while including multiple variables is confusing and muddles the metadata. Subsetting the data into manageable pieces should be a function handled by data portal software, not a structural feature of the archive itself.
Post-processing, QC, and publication is not a single-pass operation. There will be errors that will not be detected until after the data has been published. The odds are very good that any given modeling group will need to re-run at least one of its simulations, and possibly all of them. Keep this in mind while planning the high-level organization of the project.
Support for Non-Specialist End-Users
Users who aren't climate scientists (e.g., users doing impacts analysis) get a lot of value from this data, but don't have the necessary knowledge or expertise to use the raw model data. They need the data in user-friendly formats in addition to netCDF, and with familiar units, and, in some cases, with data analytic processing like bias correction already applied.
There are two main reasons for grouping variables into "tables" within the archive. For QC and publication, it's much easier to work on batches of structurally similar files than on a bunch of heterogeneous files. And for end-user convenience, it makes sense to cluster together variables with similar uses. Grouping similar variables together also lets you prioritize important outputs to make sure they get post-processed and submitted first.
The table organization from the CMIP archive is a good conceptual starting point, but there's no need to follow it too slavishly; only a minority of users will be familiar with it, and there are some significant differences in the outputs being stored.
Variables should be grouped into tables primarily by structure: time-varying data separate from static data, 2-D separate from 3-D, etc. Tables should be then ordered by priority based on importance to research and on complexity. So if data is submitted in table order, the simplest, most important variables will be archived first, and the variables that are slowest to reach the archive will be the most complicated and least important ones.
In addition, larger groupings of variables should be broken up thematically, so that each table contains no more than ten or twelve variables. This is an easily-manageable size for the QC and archiving workflow, and will make it easier for end users to find and acquire the datasets they are interested in. I would rearrange the NARCCAP tables something like this:
- Static variables*
- Primary impacts variables**
- Hydrology variables***
- Radiation variables***
- Soil variables ****
- Other 2-D variables (may need further grouping)
- 3-D state variables for further nesting
- Other 3-D variable (if any)
* These variables are all one-offs, so you need to ask for them first or you may never get them. There may be some structural inhomogeneity in this table; that's okay. This is also a good proving ground for making sure that everyone involved in the data pipeline is clear on the correct file formatting and structure.
** Daily minimum and maximum temperature need to be included in this group, which means this table is daily data. Sub-daily versions of the other variables in this table should be included elsewhere.
*** Keep in mind the fact that the variables in these tables will be used to study the balance of energy and water budgets.
**** Don't collapse soil variables to a single layer; leave them on their native levels. Include the definitions of the layers in each file as ancillary data.
Construct filenames from components in the following order. (Components with an obvious default, because they are are uniform across the entire dataset, may be omitted.)
Period/scenario should be something like "future" or "historical", not "1969010103-20011203103". The filename should make it easy to identify the correct file, but does not need to act as metadata. (And indeed, the filenaming convention should be robust against minor irregularities in file content.)
- Use . not _ in filenames.
- Avoid variable names that are subsets of other variable names (e.g., pr vs prw).
- Use filename elements with consistent lengths (i.e., all RCMs have a four-letter abbreviation, all drivers have a six-letter abbreviation, etc.)
These may seem like needlessly fine details, but they have a disproportionately large impact on ease of automation.
Assume that multiple versions of each file will be published, and include version information in both the filename and the file metadata.
Split file versions into two groups: major version changes involve changes to the primary data in the file. Minor version changes are changes to metadata or ancillary data. Number versions as major.minor.
Use a version number of 0 for preliminary data. It is likely that some data will be made available before it is first published officially, and assigning it a version number of 0 will minimize confusion.
It is desirable to maintain a copy of each major version of a file in the archive. This will increase the total archive size, but perhaps not as much as one might expect, because if there is a data update, usually the old runs will be superseded before the 3-D data is published. It is not necessary, nor is it particularly desirable, to keep copies of the minor versions of a file.
Document differences between versions assiduously. In the case of minor version changes, this documentation should be sufficient to recreate any particular version of a file should the need arise. One approach to doing this would be to dump headers, coordinates, and ancillary data for each file to text and store the result in a subversion repository.
DEALING WITH TIME
Don't split files up by time. There should be only one file per variable per simulation. (In NARCCAP, we split variables into five-year chunks to keep file sizes below the 2 GB limit imposed by older versions of the netCDF library. This was the cause of more problems than any other single issue, and if I could retroactively change one thing about the specs, it would be this.)
Specify the simulation start date, spinup length, end of coverage period, and a runout to the end of the simulation period. These dates should be chosen based on an evaluation of the proposed models and boundary conditions to ensure that every simulation can actually cover the entire official coverage period.
Before publication, remove the spinup and rundown periods, which may have ragged ends. Publish only data from the official coverage period. Like so:
sim start pub start pub end sim end
|--- spinup ---|--- coverage period ---|--- rundown ---|
Published data must cover the entire coverage period; fill any gaps with missing_value.
Use the same base date for the time coordinate for ALL runs.
The time coordinate for averaged variables goes at the midpoint of the averaging period.
The time dimension and coordinate variable should both be named
The bounds dimension should be named
bounds; the auxiliary coordinate variable should be named
Instantaneous variables should NOT have a
time_bnds variable or a
cell_methods syntax carefully. Some programs are very sensitive to whitespace and punctuation in this attribute. Be sure this attribute passes a CF-checker.
Use 00Z (midnight UTC) as the boundary between days, regardless of time zone, for daily aggregated data.
DEALING WITH SPACE
For rectilinear coordinate systems, the spatial dimensions and coordinate variables should be named
For curvilinear coordinate systems, the spatial dimensions and coordinate variables should be named
y. There should be auxiliary 2-D coordinate variables
lon, and the data variable should have a
coordinates attribute with the (partial) value
Longitudes should run 0:360, NOT -180:+180. (This is the best choice for CORDEX-North America; in Europe & Africa, the other convention may be more convenient. Regardless, all files in the archive should follow the same convention.)
Make an enumerated list of all variables that have a height associated with them (surface air temperature at 2 meters, etc.) Record height as a scalar coordinate variable named
height to the coordinates attribute of the data variable. Make sure that all other variables do not have a height coordinate.
Make sure that height is checked during QC.
Name the dummy variable that holds metadata about the map projection
gridmap, regardless of the name of the map projection used.
Check the CF spec very carefully to be sure the projection metadata is correct. Load a file into ArcGIS to check it.
Leave 3-D variables on their native vertical levels. This is better for further downscaling (which is one of the major values of 3D data). Interpolation to standard levels consumes a lot of resources, can inflate data volumes, and introduces a double interpolation when the data is used for further downscaling.
Include all the vertical levels in a single file. Splitting out pressure levels was useful in NARCCAP, but is less than ideal for downscaling, much less useful on non-standardized levels, and should be handled by portal software anyway.
DEALING WITH DATA VARIABLES
Don't pack data, and don't define
scale_factor attributes on the data variable.
To conserve space, use
float precision, not
double, for the main data variable. For simplicity, use
double for everything else. Note that map projection parameters in particular need to be defined as doubles, or else ArcGIS will fail to read them correctly.
Always set both
missing_value attributes, and set them to the same value. NARCCAP used 1e+20, but there are good arguments for setting it to 9.96921e+36, the maximum value representable with a float.
Use units appropriate to the relevant user community instead of canonical MKS units. Record surface air temperatures with units of degC, not K. Record precipitation and hydrology variables in LWE thickness units of mm, not mass units of kg/m^2.
Make a list of variables that need to have a floor and/or ceiling applied to them, and apply it.
Floor precipitation at zero, not at a trace threshold. (Sub-trace amounts may be useful in bias correction.)
Define a standard string for the
long_name attribute for each variable.
Don't allow extra, non-standard attributes on variables.
Include documentation in metadata for each variable indicating how it was produced from model outputs. This may be trivial if the saved variable is a model output, in which case an
original_name attribute will suffice, but there can be significant differences between models in the definition of things like snow and runoff.
coordinates (if appropriate)
Forbidden Attributes: Nearly eveyrthing else, in particular
QUALITY CONTROL PROCEDURES
Get a copy of Rosalyn Hatcher's
cf-checker program and run it on each file.
To check metadata, dump the headers and compare them to a template file. A file fails QC if the header does not match the template exactly. Corollary: someone will need to create a template header file for each variable, and provide these files to the modelers.
To check time, extract the time variable (and time_bounds, if applicable), delete the global attributes, and dump it to text. If the result is not 100% identical to a template time file for the same period, the file fails QC.
Dump any coordinate variables that are constant across models/experiments (e.g., height, time_bnds) to file and make sure they're identical in every instance. Corollary: set things up so that as many things as possible are identical over each dataset.
Apply min, max, and mean operators over every dimension and combination of dimensions and plot the results. Check that the resulting values are reasonable. (This step requires a human to look at the plots.) Publish the results on a website as part of the submission process. Do the same with 1-D and 2-D transects through the middle of the domain along each dimension.
Plot a histogram of values for the whole file, plus the absolute minimum and maximum value in the file, and check that the results are reasonable.
When plotting data, use ranges based on the data values. (I.e., let the plotting software pick the range automatically.) Plotting the data using a pre-specified range can mask errors.
Verify the map projection metadata and check the contents of the static variables by loading them into ArcGIS.
The plan for NARCCAP was to use CMOR to standardize output data. This didn't work because it was too much work to port the CMOR code to interface with the different regional climate models, so nobody did it.
Contributors should commit to:
- spending at least as much time on post-processing as on simulation
- running the entire simulation suite twice*
* This is not hyperbole. In NARCCAP, 6 out of 8 groups had to do (or should have done) re-runs. The average re-run size was about half of the entire set of runs from that group, but there were cases that were more extensive.
Use a professional bug-tracking system to keep track of data submission, QC, and publication.
Use version control on all the scripts you develop for QC/publication workflow.
Discuss and reach consensus on how lake temperatures will be specified before any modeling starts. If an RCM has a lake model component, be sure to turn it on.
Pay extra attention to snow in the model output; it is a very important variable for impacts that often gets short shrift in model analysis.
Electronic data transfer is highly convenient when things go well, but if they don't, don't underestimate the bandwidth of a hard drive shipped via FedEx (or carried in luggage to a meeting). If you are mailing hard drives back and forth, be sure they're formatted in a way that the recipient can read them.
End users will need to aggregate, subset, summarize, and, in some cases, interpolate the data. These operations sometimes involve subtleties that non-specialists will not have the necessary disciplinary background to handle properly. Ideally, the archive's data portal should provide or enable data services to handle these operations and insulate users from the tricky details, but data providers should be prepared to make their expertise available by creating how-to documents, software, and/or derived data products that deal with the problems correctly.
The following recommendations are somewhat "down in the weeds", but are worth mentioning nonetheless.
Variables to Include
Save geopotential height (zg). Calculating it from the hypsometric equation gives poor results, and is a waste of resources when zg is already available from most models.
Save surface relative humidity (hurs) in addition to specific humidity.
Save snow-water-equivalent (swe); snow depth (snd) is secondary and optional. The terms "snow depth" and "snow amount" can be the source of some confusion, so it is important to save the more useful variable (swe) under an unambiguous name.
Save soil temperature; it is needed for further downscaling.
Save soil moisture and temperature on their native vertical levels, not integrated over the column.
Variables to Omit
Don't save surface wind stresses (tauu and tauv); they are used for ocean-atmosphere interactions, and have little relevance to regional climate model end uses.
Don't save cloud ice and cloud water (cli and clt) as 3-D variables; they don't provide enough value to warrant the data volume. If more cloud variables are needed than just 2-D coverage (clt), adding depth and cloud base height should be sufficient, or perhaps clt at low, medium, and high levels.
Rotate all winds to true N-S and E-W directions.
Relocate/interpolate 2-D surface winds (uas and vas) to the same gridpoints used for all the other surface variables.
Keep 3-D winds on their native Arakawa grids, but be sure to note clearly in metadata which grid that is.
Make sure that the land-sea mask has been applied to sea-ice.
It is not necessary to have separate files for 2-D lat and lon, as they are included in every other file, but it is useful during QC to have a file that contains a pristine copy of lat, lon, the map projection information, and the non-varying global metadata. (This can be used as a template to check against, and can be pasted into other files to correct errors.)
Note that the variable
landtyp will likely be irregular; there are enough differences in how different land models represent land use that the files from different RCMs may not match up in standard_name or dimensionality. This file may also have a time-varying component. It may require some ad-hoc formatting decisions to fill in holes in the CF spec.
Represent sftlf (land-sea mask) as numeric 0:1, not 1/missing. Modelers should make sure they submit the land-sea mask, not the open-water/not-water mask, which will have sea ice as part of the "land".
Don't mask out elevation (orog) over oceans. Give it a value of 0 if it is unspecified in the model.