Here are some more common tasks I’ve come across when needing to edit netCDF files. This is usually when they need to be ingested into different models or post-processing scripts that require the netCDF files to be in a certain format.

Deleting a global attribute

You want to delete a single global attribute from a netCDF file.

This can be done using ncatted, e.g.:

ncatted -a global_attr_name,d,, infile.nc outfile.nc

This command takes for arguments separated by commas. Since we are specifying deletion, (d), only the first two arguments are needed, but the remaining commas bust be typed in.

Convert a variable type

A variable is of incorrect type and you need to change it. You can use ncap2 (nc arithmetic processing).

ncap2 -s 'time=float(time)'

Assumes you already have the variable defined. The -s option specifies that we are providing an inline script, within the quote marks.

Add a variable mapped over a certain dimension

You want to a variable that iterates over a given dimension, such as time. The variable should increase montonically (i.e. increase by n each time until the end of the dimension length is reached. I often find I need to do this after having merged netCDF files that were single time slices from a model output or satellite data or otherwise. ncap2 is used.

ncap2 -s 'time[$time]=array(54760,30,$time)' infile.nc outfile.nc

We are assigning the current time variable (assuming we have already added this) an array of values, specified by the (start_point, step, dimension). In this case, we get an array of values starting at 54760, increasing by 30 each point, as long as the time dimension. The -s option simply means we are giving an inline script as the input to the ncap2 program.

Add an attribute one at a time

You want to an attribute to a variable. (I.e. metadata attributes for variables, such as units, etc.). We can use ncatted for this. (netCDF attribute editor).

ncatted -a attribute,variable,a,c,"Atrribute Value" infile.nc

The -a option specifies append mode, and so we only need to supply the input file infile.nc. The value of the attribute is given in the quotation marks. The nco documentation suggested also putting single quotation marks around the comma-separated arguments as well, but I found this produced unexpected results where the double quotes were escaped and inserted into the actual attribute value as well. Could possibly be a unix thing though…

A netcdf bug to watch out for…


NCO (NetCDF Operators)

You can merge netcdf files with the nco package. NCO is a set of linux command line utilities for performing common operations on netcdf files. This is useful for mergeing a set of files such as:

ModelRun_Jan.nc
ModelRun_Feb.nc
ModelRun_Mar.nc
...

Concatenating files, creating a new dimension in the process

To concatenate files, use ncecat:

ncecat *nc -O merged.nc

This will merge all the netcdf files in a folder, creating a new record dimension if one does not exist. The record dimension is often the time dimension, for example if you have a set of netCDF files, with each one representing some spatial field at a given timestep. If appropriate, you can rename this record dimension to something more useful using the ncrename utility (another utility in the NCO package).

Renaming dimensions

ncrename -d record,time merged.nc

The -d flag specifies that we are going to rename the dimension in the netcdf file., from “record” to “time”. There are also flags to rename other attributes, see the ncrename manual page

Removing degenerate dimensions

To remove degenerate dimensions (by averaging over the dimension to be removed):

ncwa -a dim_name input.nc output.nc

ncwa (“Weighted average”) will average variables over the specified dimension. If our dimension is degenerate (dim = 1), then this is effectively a way to remove that dimension without changing any of the variable data (Since it is averaging the variable over 1).

Adding a new variable

If we need to add a new variable, this can be done with ncap2 (ncap is deprecated).

ncap2 -s'new_dim[$new_dim]=1234'

Note that this will add a single value of the time variable: 1234.

Changing a variable to vary over a newly added dimension

If we add a new dimension, the existing variables will not automatically be functions of this new dimension. So if we were to add a time dimension, we need to recreate our variables to remap over this new dimension (assuming this is correct and appropriate for that particular variable/dimension combination.)

ncap2 -s 'Var_new[$dim1, $dim2, $new_dim3]=Var_old' input.nc output.nc

Further nco utilities

The available utilities with nco are:

The NCO utilities are

  • ncap2 - arithmetic processor
  • ncatted - attribute editor
  • ncbo - binary operator
  • ncdiff - differencer
  • ncea - ensemble averager
  • ncecat - ensemble concatenator
  • ncflint - file interpolator
  • ncks - kitchen sink (extract, cut, paste, print data)
  • ncpdq - permute dimensions quickly
  • ncra - running averager
  • ncrcat - record concatenator
  • ncrename - renamer
  • ncwa - weighted averager

CDO (Climate Data Operators)

This is an equally capable set of netCDF tools written by the Max Planck Institute for Meteorology.

CDO tools page


Here are some notes on the various ways to check disk space usage on linux:

Disk usage of all (non-hidden) files and folders

Using the du command (disk use) with the -s (summarize) and -h (human-readable) options.

du -sh *

Prints a list of all files and folders in the current directory. Folder/directory sizes given includes all the subdirectories and any files they contain.

BONUS: You can add the -c option to get the total size of all the files that this command lists. (I.e. du -sch *)

Total disk usage including hidden files

This is useful, particularly if you are trying to get comparable outputs from the quota command (see below). Hidden files are not included by default, so we have to use wildcard matching like so:

du -sch .[!.]* * 

The first . matches files beginning with a dot, but we want to exclude .., since this would match the directory above (as in when you do cd .. etc.) and we don’t want to include that. To exclude that pattern we add [!.], in other words, match a single dot but not two in a row. The final asterisk is to match all the non-hidden files as before.

Another way to do this is to specify the -ahd1 set of options:

du -ahd1

However, this also includes the file ., which refers to the current directory. If you add the -c option to this

Sorted disk usage

The easiest way to do this is to just pipe the results into the sort command. To sort them numerically, we can just add the -n flag to sort.

du -sch * | sort -n

This will sort them numerically in ascending order. Have a look at the sort manual pages for more sorting opttions.

Disk usage by file system

df is a slightly different unix command which lists disk free space by file system. Typing it with no options will give a list of all mounted file systems, their total size available, how much space has been used, and their linux mount points. The -h option gives a more human-readable form.

df -h

Disk quota information

On systems where you have an allocated disk quota, the quota utility tells you how much of your quota has been used and how much you have available. The -s option gives a nice summary of disk quota:

quota -s

Outputs:

Disk quotas for user bob (uid 123456): 
     Filesystem   space   quota   limit   grace   files   quota   limit   grace
mydomain:/disk/someserver/u1234//dvalters
                  2000M  15000M  15000M           12000       0       0  

The space output is sometimes replaced with blocks, which is a slightly less helpful measure. On POSIX systems, 1 block is befined as 512 bytes. space (or blocks tells you how much you have used, quota is your total quota allowance. If you are over the quota, an asterisk appears next to the number for space/blocks.

Other utilities

I have only covered the most common GNU/Linux utilities for monitoring/measuring disk usage. There are othe utilities available that have nicer outputs by default, such as: ncdu, freespace etc.


Optimisations with python data structure generation

This post explores some of the differences in using list comprehensions and generator expressions to build lists in Python. We are going to look at the memory and CPU performance using various Python profiling tools and packages.

Firstly a list comprehension can be built in python using:

mylist = [x for x in somedata]

which constructs the list for every item in somedata. The list comprehension may also be built out of a function that returns an item on return, e.g. [x for x in somefunc()]. The important thing to note in list comprehensions is that the whole list is evaluated at once, this is in contrast to the generator expression which is “short-circuiting” and will exit early if the expression permits it so. Generators can be a useful alternative where an algorithm is likely to finish early if certain conditions are met. For example:

# Using a list comprehension
mybool = any([x for x in somefunc()])

# Using a generator expression
mybool = any(x for x in somefunc())

In the case of the list comprehension the entire list is evaluated first, and then run through the any() function. In the generator case, the any() test is evaluated every iteration of somefunc(), and if it returns true the any test can return early without having to build the entire list.

In theory then, generator expressions offer a potential performance benefit compared to their list comprehension counterparts. But how does it play out in practice?

Testing

We’re going to use an example that builds lists of random strings. Here’s a function that returns a random string:

import random
import string

def randomword(length):
   return ''.join(random.choice(string.lowercase) for i in range(length))

Now we need a function that builds the lists using each method. First the list comprehension way:

def list_strings_comprehension(length):
    list_strings = [randomword(5) for i in range(length)]
    list_strings.sort()
    return list_strings

And now a function that uses the generator approach:

def list_strings_generator(length):
    listgen_strings = sorted(randomword(5) for i in range(length))
    return listgen_strings

Let’s also create some functions for testing ints, just to see if there is any difference with the data type.

def list_ints_comprehension(length):
    list_ints = [i for i in range(length)]
    list_ints.sort()
    return list_ints

def list_ints_generator(length):
    listgen_ints = sorted(i for i in range(length))
    return listgen_ints

timeit

Now we are going to test these methods with the timeit command, built in to the python interpreter. Using IPython, this can be run using the command: %timit [FUNCTION_NAME(ARGS)]. Help for this command is accessed with %timeit?.

Using our integer list building methods:

%timeit list_ints_comprehension(100000)
#>> 100000 loops, best of 3: 8.74 ms per loop

%timeit list_ints_generator(100000)
#>> 100000 loops, best of 3: 11.2 ms per loop

So, it would appear at first approximation, the generator approach is slower, at least for building a list of integers this size.

Memory usage

Now let’s investigate the impact on memory use. Memory use is tricky to measure in Python, as objects can have a deeply nested structure, making it diffcult to fully trace the memory footprint of objects. The python interpreter also performs garbage collection atcertain intervals, meaning it can be difficult to reproduce tests of memory consumption.

First we’re going to use the built in sys.getsizeof()

# Eight
listy = list_strings_comprehension(8)
print "Eight Listy: ", sys.getsizeof(listy)

genny = list_strings_generator(8)
print "Eight Genny: ", sys.getsizeof(genny)

#>> Eight Listy:  136
#>> Eight Genny:  168

# Ten
listy = list_strings_comprehension(10)
print "Ten Listy: ", sys.getsizeof(listy)

genny = list_strings_generator(10)
print "Ten Genny: ", sys.getsizeof(genny)

#>> Ten Listy:  200
#>> Ten Genny:  168

# 100
listy = list_strings_comprehension(100)
print "Small Listy: ", sys.getsizeof(listy)

genny = list_strings_generator(100)
print "Small Genny: ", sys.getsizeof(genny)

#>> Small Listy:  920
#>> Small Genny:  992

# 1000
listy = list_strings_comprehension(1000)
print "Medium Listy: ", sys.getsizeof(listy)

genny = list_strings_generator(1000)
print "Medium Genny: ", sys.getsizeof(genny)

#>> Medium Listy:  9032
#>> Medium Genny:  8552

# One million
listy = list_strings_comprehension(1000000)
print "Big Listy: ", sys.getsizeof(listy)

genny = list_strings_generator(1000000)
print "Big Genny: ", sys.getsizeof(genny)

#>> Big Listy:  8697472
#>> Big Genny:  8250176

Interestingly, the generator performs better in most cases, execpt for the smallest example with eight strings. With larger lists than this, the generator approach consistently outperforms the list comprehension method in terms of its memory footpring, when building lists of strings and measuring with the sys.getsizeof() function.

pympler asizeof()

The pympler package is reportedly more accurate at deteriming the true memory footprint of a Python object. USing the asizeof() method with the same tests as above, we get:

from pympler import asizeof

listy = list_strings_comprehension(1000)
print "Medium Listy: ", asizeof.asizeof(listy)

genny = list_strings_generator(1000)
print "Medium Genny: ", asizeof.asizeof(genny)

#>> Medium Listy:  57032
#>> Medium Genny:  56552

listy = list_strings_comprehension(100000)
print "Big Listy: ", asizeof.asizeof(listy)

genny = list_strings_generator(100000)
print "Big Genny: ", asizeof.asizeof(genny)

#>> Big Listy:  5624472
#>> Big Genny:  5679848

listy = list_strings_comprehension(1000000)
print "Million Listy: ", asizeof.asizeof(listy)

genny = list_strings_generator(1000000)
print "Million Genny: ", asizeof.asizeof(genny)

#>> Million Listy:  56697472
#>> Million Genny:  56250176

memory_profiler

Another option is the memory_profiler package. This provides another IPython magic command: %memit, which can be used like so:

import gc
gc.collect() # Run the garbage collector first.

%memit -i 0.000001 list_strings_comprehension(1000000)
#>> peak memory: 230.33 MiB, increment: 48.00 MiB

gc.collect()
%memit -i 0.000001 list_strings_generator(1000000)
#>> peak memory: 233.61 MiB, increment: 51.27 MiB

Pympler’s asizeof() says the list comprehension is bigger, memory_profiler says the generator sees the bigger memory footprint…TBC


The Weather Research and Forecasting model (WRF) can be initialised with a range of input data sources for simulations. The initialisation step describes the setting of grid parameters within the model domain (pressure, surface variables, etc.) as well as defining the boundary conditions for the model. If you have followed the excellent WRF tutorial and run a few of the case studies with real data the input data is provided for you and is already tested to ensure it can be pre-processed relatively painlessly by the WPS (WRF pre-processing system). Datsets from North American providers are extenisvely tested with WRF, (i.e. GFS data (global), AWIP (North American continent area))

I’ve recently begun using ECMWF data to intialise WRF simulations and a few extra steps not shown in the standard tutorials were required to get the data pre-processed correctly by WPS. Certain ECWF datasets, such as reanalysis data, can be accessed and downloaded freely from their public data portal. Global reanalysis data is availble for the 20th century and interim data for the most recent years. In this example I’m using the ERA-20C global reanalysis dataset to set up a case study of some severe storm events over Great Britain in 2005. (The ERA-20C actually extends into the 21st century as well now).

The data come in several different sets; as a minimum for intialising the WRF model you will need some surface variable data, and then pressure level data (pressures at different heights in the atmosphere). You could also use model level data directly, but WRF can interpolate this for you from the pressure level data. There is one other data set you need, which is a land-sea surface mask. This is called invariant data, as it does not change over time, and is found under the ‘invariant’ tab on the ECMWF page. Technically, WRF already has invariant data bundled with it but I found I had to download the ECMWF land-sea mask separately for WPS to work correctly.

To summarise, you need three separate data files from the reanalysis data:

  1. Surface variable data
  2. Pressure level data (or model levels)
  3. Land-sea mask.

Downloading the data

You’ll be required to select which surface varibles you want to download, as well as which pressure levels, too. The land-sea mask is just a single file. Although probably not the most efficient method, I tend to just select all the surface variables for the surface data, (you don’t actually need all of them to intiailise the model, but I find it easier to just download everything in case it’s required later). Once you’ve selected the date range, and varaibles of interest, you can proceed directly to the download by clicking the GRIB or netCDF download buttons.

Downloading via Python script

ECMWF have provided a very handy python API for downloading data without necessarily having to use the web interface, which can save time if you already know exactly which data fields you want. The details of the Python API are here, it’s well explained so I will only summarise here what you need to do:

  1. Register on the ECMWF site
  2. Download an access key to placed on the computer or system you are downloading to.
  3. Install the ecmwfapi Python module.
  4. Either write your own python script using the API documentation above, or have the ECMWF web interface generate one for you. (Click the ‘view MARS request’ after making your selections, and then copy the Python script that is displayed. An example python script looks like this:
#!/usr/bin/env python
from ecmwfapi import ECMWFDataServer
server = ECMWFDataServer()
server.retrieve({
    "class": "e2",
    "dataset": "era20c",
    "date": "2009-06-01/to/2009-06-30",
    "expver": "1",
    "levtype": "sfc",
    "param": "15.128/16.128/17.128/18.128/31.128/32.128/33.128/34.128/35.128/36.128/37.128/38.128/39.128/40.128/41.128/42.128/53.162/54.162/55.162/56.162/57.162/58.162/59.128/59.162/60.162/61.162/62.162/63.162/64.162/65.162/66.128/66.162/67.128/67.162/68.162/69.162/70.162/71.162/72.162/73.162/74.162/75.162/76.162/77.162/78.128/78.162/79.128/79.162/80.162/81.162/82.162/83.162/84.162/85.162/86.162/87.162/88.162/89.162/89.228/90.162/90.228/91.162/92.162/131.228/132.228/134.128/136.128/137.128/139.128/141.128/148.128/151.128/159.128/164.128/165.128/166.128/167.128/168.128/170.128/174.128/183.128/186.128/187.128/188.128/198.128/206.128/229.128/230.128/231.128/232.128/235.128/236.128/238.128/243.128/244.128/245.128/246.228/247.228",
    "stream": "oper",
    "time": "00:00:00",
    "type": "an",
    "target": "CHANGEME",
})

ECMWF provide the data in two different file formats, GRIB (gridded-binary) and netCDF (.nc files). WPS comes with the ungribber tool (ungrib.exe) so I’ve gone for the grib data format here. (Selected by default in the Python download script).

Important: Retrieveing the data in Gaussian gridded format

By default, the ECMWF site will download the data on a spherical harmonic grid, and the version of ungrib supplied in WPS v3.8.1 will not be able to decode this properly. (You may get the error Unknown ksec2(4): 50 in the ungrib error log, if you try this). Ungrib expects a regular Gaussian grid, and I couldn’t find a way to easily access this from the web portal interface. However, using the Python script above, you can easily request Gaussian gridded data by adding the parameter:

"grid": "160",

to the python download script, and your grib data will be supplied in Gaussian grid format. The CHANGEME value is the name of the downloaded file and you should probably change it to something meaningful. The python script will download the grib file to the same directory it is run in. For the example in this post, I ended up with three python scripts, one for the surface data, one for pressure levels, and one for the land-sea mask. (You could of course bundle them all in to one script).

Ungribbing the data

Now we need to ‘ungrib’ data to convert into the WPS intermediate file format, before running metgrid. This has to be done in two stages - one for the surface and pressure level data, and one for the land-sea mask. This is because the land-sea mask has a ‘start date’ of 1900-01-01 if you try to run ungrib with the date of your case study, ungrib will fail, complaining that the dates specified could not be found. In the namelist.wps file, I set the the &ungrib section to the following:

&ungrib
  out_format = 'WPS',
  prefix = 'FIX',

Link your land-sea mask to the WPS directory with the link_grib.sh script and run ungrib. Repeat again for the surface and pressure data but with the following section in the namelist.wps file:

&ungrib
  out_format = 'WPS',
  prefix = 'FILE',

To be honest, you can use whatever file prefixes you like, but I like to use this naming convention.

Make sure you have linked the correct Vtable before running ungrib. The ECMWF Vtable supplied with WRF v3.8.1 worked fine without any modifications in this case. The Vtable can be found in WPS/ungrib/Variable_Tables/Vtable.ECMWF

Metgrid

Metgrid interpolates your ungribbed data files over the model domain. (I haven’t gone through the model domain generation stage with geogrid.exe as this blog post is only about prepping the input data.)

If you are using the sea surface temparatures field from the ECMWF data, you’ll need to make a few changes to the METGRID.TBL file for it to correctly interpolate and mask the sea surface temparatures around land. IN the METGRID.TBL file (located in WPS/metgrid/), change the entry of the SST field to the following:

SST
  interp_option=sixteen_pt+four_pt+wt_average_4pt+wt_average_16pt+search
  missing_value=-1.E30
  masked=land
  interp_mask=LANDMASK(1)
  fill_missing=0.
  flag_in_output=FLAG_SST 

The changes are to make the interp_mask use the LANDMASK mask instead of LANDSEA (the default), and to change the interpolation option slightly. Without the changes, I found that for my inner domain the sea surface temperatures were incorrectly masked, and had been interpolated over land as well. Although metgrid.exe did not complain when run, the met files generated caused an error when real.exe was run - generating an error message saying:

-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE:  <stdin>  LINE:    2970
mismatch_landmask_ivgtyp
-------------------------------------------

The changes to METGRID.TBL should remedy this error message.

Before running metgrid, there is one last change to make to the namelist.wps file:

&metgrid
  fg_name = 'FILE',
  constants_name = 'FIX:1900-01-01_00',

This tells metgrid to use the invariant data we downloaded earlier as a constant field (i.e. the land-sea mask doesn’t change over time, so it doesn’t need to be interpolated for each period). Use whichever name you have used for this invariant file.

Now you can run metgrid. Hopefully it will produce all the input met files needed, and correctly interpolated. It’s a good idea before running real.exe and wrf.exe to check that the fields look reasonable. In particular, check the SST field if you are using sea-surface temperatures. Check the nested domains as well - I found that my SST field had been incorrectly interpolated in the inner domain, which required the change to the METGRID.TBL file above. ncview is a useful utility for checking the met files generated by metgrid.

Real and WRF

You are now set to run real.exe to generate the lateral boundary conditions and initialise the model, followed (finally) by wrf.exe to run the simulation.

Sources

The following discussion boards and mailing list answers were helpful in preparing this post:

Also two useful blogposts from NCAR: Analysis, Forecast, Reanalysis, What’s the difference?.

WRF-able datasets.