These notes record how to mask (set to a no-data or missing-data value) a single netCDF variable, using a landsea mask.
Problem: We have a netCDF file containing a variable mapped over the globe. The variable should not have any values over the sea areas, because that doesn’t make sense for this particular data. (Let’s say it’s output from some post processed satellite data for argument). All the data points over seas and oceans are zero, but we can’t use zero as ‘NoData’ becuase zero is a valid value over land as well.
We want to use the
_FillValue attribute of netCDF variables as a mask, so that when plotting or calculations are done using the variable, the sea values are ignored.
We also have a binary 1/0 land-sea mask in a separate file, at the same resolution as out JULES out put data. The land values have 1, and the seas/oceans have 0.
First copy over the landsea mask variable from landsea.nc to the file to be masked
ncks -A -v lsmask landsea.nc data_to_be_masked.nc
Then make sure that the variable in the other file (gpp_gb) has a _FillValue set correctly (using -9999 here, but any convention will do (not zero though…)
ncatted -a _FillValue,gpp_gb,o,f,-9999 data_to_be_masked.nc
Then use ncap2 to set all the gpp values to _FillValue (the mask) whereever the landsea mask is not land (i.e. not == 1)
This is written to a new file.
``` ncap2 -s 'where(lsmask != 1) gpp_gb=gpp_gb@_FillValue' data_to_be_masked.nc data_after_masking_done.nc ```
The output variable
gpp_gb now has the sea values set to the Nodata FillValue.
Here are some more common tasks I’ve come across when needing to edit netCDF files. This is usually when they need to be ingested into different models or post-processing scripts that require the netCDF files to be in a certain format.
Deleting a global attribute
You want to delete a single global attribute from a netCDF file.
This can be done using
ncatted -a global_attr_name,d,, infile.nc outfile.nc
This command takes for arguments separated by commas. Since we are specifying deletion, (d), only the first two arguments are needed, but the remaining commas bust be typed in.
Convert a variable type
A variable is of incorrect type and you need to change it. You can use
ncap2 (nc arithmetic processing).
ncap2 -s 'time=float(time)'
Assumes you already have the variable defined. The
-s option specifies that we are providing an inline script, within the quote marks.
Add a variable mapped over a certain dimension
You want to a variable that iterates over a given dimension, such as time. The variable should increase montonically (i.e. increase by n each time until the end of the dimension length is reached. I often find I need to do this after having merged netCDF files that were single time slices from a model output or satellite data or otherwise.
ncap2 is used.
ncap2 -s 'time[$time]=array(54760,30,$time)' infile.nc outfile.nc
We are assigning the current time variable (assuming we have already added this) an array of values, specified by the
(start_point, step, dimension). In this case, we get an array of values starting at 54760, increasing by 30 each point, as long as the
time dimension. The
-s option simply means we are giving an inline script as the input to the
Add an attribute one at a time
You want to an attribute to a variable. (I.e. metadata attributes for variables, such as units, etc.). We can use ncatted for this. (netCDF attribute editor).
ncatted -a attribute,variable,a,c,"Atrribute Value" infile.nc
-a option specifies append mode, and so we only need to supply the input file
infile.nc. The value of the attribute is given in the quotation marks. The nco documentation suggested also putting single quotation marks around the comma-separated arguments as well, but I found this produced unexpected results where the double quotes were escaped and inserted into the actual attribute value as well. Could possibly be a unix thing though…
A netcdf bug to watch out for…
NCO (NetCDF Operators)
You can merge netcdf files with the
nco package. NCO is a set of linux command line utilities for performing common operations on netcdf files. This is useful for mergeing a set of files such as:
ModelRun_Jan.nc ModelRun_Feb.nc ModelRun_Mar.nc ...
Concatenating files, creating a new dimension in the process
To concatenate files, use
ncecat *nc -O merged.nc
This will merge all the netcdf files in a folder, creating a new record dimension if one does not exist. The record dimension is often the time dimension, for example if you have a set of netCDF files, with each one representing some spatial field at a given timestep. If appropriate, you can rename this record dimension to something more useful using the
ncrename utility (another utility in the NCO package).
ncrename -d record,time merged.nc
The -d flag specifies that we are going to rename the dimension in the netcdf file., from “record” to “time”. There are also flags to rename other attributes, see the ncrename manual page
Removing degenerate dimensions
To remove degenerate dimensions (by averaging over the dimension to be removed):
ncwa -a dim_name input.nc output.nc
ncwa (“Weighted average”) will average variables over the specified dimension. If our dimension is degenerate (dim = 1), then this is effectively a way to remove that dimension without changing any of the variable data (Since it is averaging the variable over 1).
Adding a new variable
If we need to add a new variable, this can be done with
ncap is deprecated).
Note that this will add a single value of the time variable: 1234.
Changing a variable to vary over a newly added dimension
If we add a new dimension, the existing variables will not automatically be functions of this new dimension. So if we were to add a time dimension, we need to recreate our variables to remap over this new dimension (assuming this is correct and appropriate for that particular variable/dimension combination.)
ncap2 -s 'Var_new[$dim1, $dim2, $new_dim3]=Var_old' input.nc output.nc
Further nco utilities
The available utilities with nco are:
The NCO utilities are
- ncap2 - arithmetic processor
- ncatted - attribute editor
- ncbo - binary operator
- ncdiff - differencer
- ncea - ensemble averager
- ncecat - ensemble concatenator
- ncflint - file interpolator
- ncks - kitchen sink (extract, cut, paste, print data)
- ncpdq - permute dimensions quickly
- ncra - running averager
- ncrcat - record concatenator
- ncrename - renamer
- ncwa - weighted averager
CDO (Climate Data Operators)
This is an equally capable set of netCDF tools written by the Max Planck Institute for Meteorology.
Here are some notes on the various ways to check disk space usage on linux:
Disk usage of all (non-hidden) files and folders
du command (disk use) with the
-s (summarize) and
-h (human-readable) options.
du -sh *
Prints a list of all files and folders in the current directory. Folder/directory sizes given includes all the subdirectories and any files they contain.
BONUS: You can add the
-c option to get the total size of all the files that this command lists. (I.e.
du -sch *)
Total disk usage including hidden files
This is useful, particularly if you are trying to get comparable outputs from the
quota command (see below). Hidden files are not included by default, so we have to use wildcard matching like so:
du -sch .[!.]* *
. matches files beginning with a dot, but we want to exclude
.., since this would match the directory above (as in when you do
cd .. etc.) and we don’t want to include that. To exclude that pattern we add
[!.], in other words, match a single dot but not two in a row. The final asterisk is to match all the non-hidden files as before.
Another way to do this is to specify the
-ahd1 set of options:
However, this also includes the file
., which refers to the current directory. If you add the
-c option to this
Sorted disk usage
The easiest way to do this is to just pipe the results into the sort command. To sort them numerically, we can just add the
-n flag to sort.
du -sch * | sort -n
This will sort them numerically in ascending order. Have a look at the sort manual pages for more sorting opttions.
Disk usage by file system
df is a slightly different unix command which lists disk free space by file system. Typing it with no options will give a list of all mounted file systems, their total size available, how much space has been used, and their linux mount points. The
-h option gives a more human-readable form.
Disk quota information
On systems where you have an allocated disk quota, the
quota utility tells you how much of your quota has been used and how much you have available. The
-s option gives a nice summary of disk quota:
Disk quotas for user bob (uid 123456): Filesystem space quota limit grace files quota limit grace mydomain:/disk/someserver/u1234//dvalters 2000M 15000M 15000M 12000 0 0
space output is sometimes replaced with
blocks, which is a slightly less helpful measure. On POSIX systems, 1 block is befined as 512 bytes.
blocks tells you how much you have used,
quota is your total quota allowance. If you are over the quota, an asterisk appears next to the number for space/blocks.
I have only covered the most common GNU/Linux utilities for monitoring/measuring disk usage. There are othe utilities available that have nicer outputs by default, such as: ncdu, freespace etc.
Optimisations with python data structure generation
This post explores some of the differences in using list comprehensions and generator expressions to build lists in Python. We are going to look at the memory and CPU performance using various Python profiling tools and packages.
Firstly a list comprehension can be built in python using:
which constructs the list for every item in
somedata. The list comprehension may also be built out of a function that returns an item on return, e.g.
[x for x in somefunc()]. The important thing to note in list comprehensions is that the whole list is evaluated at once, this is in contrast to the generator expression which is “short-circuiting” and will exit early if the expression permits it so. Generators can be a useful alternative where an algorithm is likely to finish early if certain conditions are met. For example:
In the case of the list comprehension the entire list is evaluated first, and then run through the
any() function. In the generator case, the
any() test is evaluated every iteration of somefunc(), and if it returns true the any test can return early without having to build the entire list.
In theory then, generator expressions offer a potential performance benefit compared to their list comprehension counterparts. But how does it play out in practice?
We’re going to use an example that builds lists of random strings. Here’s a function that returns a random string:
Now we need a function that builds the lists using each method. First the list comprehension way:
And now a function that uses the generator approach:
Let’s also create some functions for testing ints, just to see if there is any difference with the data type.
Now we are going to test these methods with the
timeit command, built in to the python interpreter. Using IPython, this can be run using the command:
%timit [FUNCTION_NAME(ARGS)]. Help for this command is accessed with
Using our integer list building methods:
So, it would appear at first approximation, the generator approach is slower, at least for building a list of integers this size.
Now let’s investigate the impact on memory use. Memory use is tricky to measure in Python, as objects can have a deeply nested structure, making it diffcult to fully trace the memory footprint of objects. The python interpreter also performs garbage collection atcertain intervals, meaning it can be difficult to reproduce tests of memory consumption.
First we’re going to use the built in
Interestingly, the generator performs better in most cases, execpt for the smallest example with eight strings. With larger lists than this, the generator approach consistently outperforms the list comprehension method in terms of its memory footpring, when building lists of strings and measuring with the
The pympler package is reportedly more accurate at deteriming the true memory footprint of a Python object. USing the
asizeof() method with the same tests as above, we get:
Another option is the
memory_profiler package. This provides another IPython magic command:
%memit, which can be used like so:
asizeof() says the list comprehension is bigger,
memory_profiler says the generator sees the bigger memory footprint…TBC