CHAPTER 7 CDMS Utilities

cdscan: Importing datasets into CDMS

Overview

A dataset is a partitioned collection of files. To create a dataset, the files must be scanned to produce a text representation of the dataset. CDMS represents datasets as an ASCII metafile in the CDML markup language. The file contains all metadata, together with information describing how the dataset is partitioned into files. (Note: CDMS provides a direct interface to individual files as well. It is not necessary to scan an individual file in order to access it.)

For CDMS applications to work correctly, it is important that the CDML metafile be valid. The cdscan utility generates a metafile from a collection of data files.

CDMS assumes that there is some regularity in how datasets are partitioned:

Otherwise, there is considerable flexibility in how a dataset can be partitioned:

cdscan Syntax

The syntax of the cdscan command is

cdscan [options] file1 file2 ...

or

cdscan [options] -f file_list

where

Output is written to standard output by default. Use the -x option to specify an output filename.

 

cdscan command options

Option

Description

-a alias_file

Change variable names to the aliases defined in an alias file. Each line of the alias file consists of two blank separated fields: variable_id alias . variable_id is the ID of the variable in the file, and alias is the name that will be substituted for it in the output dataset. Only variables with entries in the alias_file are renamed.

-c calendar

Specify the dataset calendar attribute. One of " gregorian " (default), " julian ", " noleap ", "proleptic_gregorian", "standard" , or " 360_day ".

-d dataset_id

String identifier of the dataset. Should not contain blanks or non-printing characters. Default: " none "

-f file_list

File containing a list of absolute data file names, one per line.

-h

Print a help message.

-i time_delta

Causes the time dimension to be represented as linear, producing a more compact representation. This is useful if the time dimension is very long. time_delta is a float or integer. For example, if the time delta is 6 hours, and the reference units are ` hours since xxxx' , set the time delta to 6. See the -r option. See Note 2.

-j

scan time as a vector dimension. Time values are listed individually. Turns off the -i option.

-l levels

Specify that the files are partitioned by vertical level. That is, data for different vertical levels may appear in different files. levels is a comma-separated list of levels containing no blanks. See Note 3.

-m levelid

name of the vertical level dimension. The default is the vertical dimension as determined by CDMS. See Note 3.

-p template

Add a file template string, for compatibility with pre-V3.0 datasets. 'cdimport -h' describes template strings.

-q

Quiet mode.

-r time_units

time units of the form " units since yyyy-mm-dd hh:mi:ss ", where units is one of "year", "month", "day", "hour", "minute", "second" .

-s suffix_file

Append a suffix to variable names, depending on the directory containing the data file. This can be used to distinguish variables having the same name but generated by different models or ensemble runs. 'suffix_file' is the name of a file describing a mapping between directories and suffixes. Each line consists of two blank-separated fields: directory suffix . Each file path is compared to the directories in the suffix file. If the file path is in that directory or a subdirectory, the corresponding suffix is appended to the variable IDs in the file. If more than one such directory is found, the first directory found is used. If no match is made, the variable ids are not altered. Regular expressions can be used: see the example in the Notes section.

-t timeid

id of the partitioned time dimension. The default is the name of the time dimension as determined by CDMS. See Note 1.

-x xmlfile

Output file name. By default, output is written to standard output.

Notes:

  1. Files can be in netCDF, GrADS/GRIB, HDF, or DRS format, and can be listed in any order. Most commonly, the files are the result of a single experiment, and the 'partitioned' dimension is time. The time dimension of a variable is the coordinate variable having a name that starts with 'time' or having an attribute axis='T'. If this is not the case, specify the time dimension with the -t option. The time dimension should be in the form supported by cdtime. If this is not the case (or to override them) use the -r option.
  2. By default, the time values are listed explicitly in the output XML. This can cause a problem if the time dimension is very long, say for 6-hourly data. To handle this the form 'cdscan -i delta <files>' may be used. This generates a compact time representation of the form <start, length, delta>. An exception is raised if the time dimension for a given file is not linear.
  3. Another form of the command is 'cdscan -l lev1,lev2,..,levn <files>'. This asserts that the dataset is partitioned in both time and vertical level dimensions. The level dimension of a variable is the dimension having a name that starts with "lev", or having an attribute "axis=Z". If this is not the case, set the level name with the -m option.
  4. An example of a suffix file:

    /exp/pr/ncar-a _ncar-a
    /exp/pr/ecm-a _ecm-a
    /exp/ta/ncar-a _ncar-a
    /exp/ta/ecm-a _ecm-a

    For all files in directory /exp/pr/ncar-a or a subdirectory, the corresponding variable ids will be appended with the suffix '_ncar-a'. Regular expressions can be used, as defined in the Python 're' module. For example, The previous example can be replaced with the single line:

    /exp/[^/]*/([^/]*) _\g<1>

    Note the use of parentheses to delimit a group. The syntax \g<n> refers to the n-th group matched in the regular expression, with the first group being n=1. The string [^/]* matches any sequence of characters other than a forward slash.
Examples

cdscan -c noleap -d test -x test.xml [uv]*.nc

cdscan -d pcmdi_6h -i 0.25 -r 'days since 1979-1-1' *6h*.ctl

File Formats

Data may be represented in a variety of self-describing binary file formats, including

Name Aliasing

A problem can occur if variables in different files are defined on different grids. What if the axis names are the same? CDMS requires that within a dataset, axis and variable IDs (names) be unique. What should the longitude axes be named in CDMS to ensure uniqueness? The answer is to allow CDMS IDs to differ from file names.

If a variable or axis has a CDMS ID which differs from its name in the file, it is said to have an alias. The actual name of the object in the file is stored in the attribute name_in_file. cdscan uses this mechanism (with the -a and -s options) to resolve name conflicts; a new axis or variable ID is generated, and the name_in_file is set to the axis name in the file.

Name aliases also can be used to enforce naming standards. For data received from an outside organization, variable names may not be recognized by existing applications. Often it is simpler and safer to add an alias to the metafile rather than rewrite the data.

Generating Metadata for a File

A single file can be accessed directly in CDMS, without ingesting. However, frequently it is useful to generate an ASCII description of the metadata in the file. To do this, use the filename as the template argument:

cdimport . clt.nc sample

Go to Main Go to Previous Go to Next