A dataset is a partitioned collection of files. To create a dataset, the files must be scanned to produce a text representation of the dataset. CDMS represents datasets as an ASCII metafile in the CDML markup language. The file contains all metadata, together with information describing how the dataset is partitioned into files. (Note: CDMS provides a direct interface to individual files as well. It is not necessary to scan an individual file in order to access it.)
For CDMS applications to work correctly, it is important that the CDML metafile be valid. The cdscan utility generates a metafile from a collection of data files.
CDMS assumes that there is some regularity in how datasets are partitioned:
Otherwise, there is considerable flexibility in how a dataset can be partitioned:
The syntax of the cdscan command is
cdscan [options] file1 file2 ...
cdscan [options] -f file_list
Output is written to standard output by default. Use the -x option to specify an output filename.
cdscan -c noleap -d test -x test.xml [uv]*.nc
cdscan -d pcmdi_6h -i 0.25 -r 'days since 1979-1-1' *6h*.ctl
A problem can occur if variables in different files are defined on different grids. What if the axis names are the same? CDMS requires that within a dataset, axis and variable IDs (names) be unique. What should the longitude axes be named in CDMS to ensure uniqueness? The answer is to allow CDMS IDs to differ from file names.
If a variable or axis has a CDMS ID which differs from its name in the file, it is said to have an alias. The actual name of the object in the file is stored in the attribute name_in_file. cdscan uses this mechanism (with the -a and -s options) to resolve name conflicts; a new axis or variable ID is generated, and the name_in_file is set to the axis name in the file.
Name aliases also can be used to enforce naming standards. For data received from an outside organization, variable names may not be recognized by existing applications. Often it is simpler and safer to add an alias to the metafile rather than rewrite the data.