Optimisations with python data structure generation

This post explores some of the differences in using list comprehensions and generator expressions to build lists in Python. We are going to look at the memory and CPU performance using various Python profiling tools and packages.

Firstly a list comprehension can be built in python using:

mylist = [x for x in somedata]

which constructs the list for every item in somedata. The list comprehension may also be built out of a function that returns an item on return, e.g. [x for x in somefunc()]. The important thing to note in list comprehensions is that the whole list is evaluated at once, this is in contrast to the generator expression which is “short-circuiting” and will exit early if the expression permits it so. Generators can be a useful alternative where an algorithm is likely to finish early if certain conditions are met. For example:

# Using a list comprehension
mybool = any([x for x in somefunc()])

# Using a generator expression
mybool = any(x for x in somefunc())

In the case of the list comprehension the entire list is evaluated first, and then run through the any() function. In the generator case, the any() test is evaluated every iteration of somefunc(), and if it returns true the any test can return early without having to build the entire list.

In theory then, generator expressions offer a potential performance benefit compared to their list comprehension counterparts. But how does it play out in practice?

Testing

We’re going to use an example that builds lists of random strings. Here’s a function that returns a random string:

import random
import string

def randomword(length):
   return ''.join(random.choice(string.lowercase) for i in range(length))

Now we need a function that builds the lists using each method. First the list comprehension way:

def list_strings_comprehension(length):
    list_strings = [randomword(5) for i in range(length)]
    list_strings.sort()
    return list_strings

And now a function that uses the generator approach:

def list_strings_generator(length):
    listgen_strings = sorted(randomword(5) for i in range(length))
    return listgen_strings

Let’s also create some functions for testing ints, just to see if there is any difference with the data type.

def list_ints_comprehension(length):
    list_ints = [i for i in range(length)]
    list_ints.sort()
    return list_ints

def list_ints_generator(length):
    listgen_ints = sorted(i for i in range(length))
    return listgen_ints

timeit

Now we are going to test these methods with the timeit command, built in to the python interpreter. Using IPython, this can be run using the command: %timit [FUNCTION_NAME(ARGS)]. Help for this command is accessed with %timeit?.

Using our integer list building methods:

%timeit list_ints_comprehension(100000)
#>> 100000 loops, best of 3: 8.74 ms per loop

%timeit list_ints_generator(100000)
#>> 100000 loops, best of 3: 11.2 ms per loop

So, it would appear at first approximation, the generator approach is slower, at least for building a list of integers this size.

Memory usage

Now let’s investigate the impact on memory use. Memory use is tricky to measure in Python, as objects can have a deeply nested structure, making it diffcult to fully trace the memory footprint of objects. The python interpreter also performs garbage collection atcertain intervals, meaning it can be difficult to reproduce tests of memory consumption.

First we’re going to use the built in sys.getsizeof()

# Eight
listy = list_strings_comprehension(8)
print "Eight Listy: ", sys.getsizeof(listy)

genny = list_strings_generator(8)
print "Eight Genny: ", sys.getsizeof(genny)

#>> Eight Listy:  136
#>> Eight Genny:  168

# Ten
listy = list_strings_comprehension(10)
print "Ten Listy: ", sys.getsizeof(listy)

genny = list_strings_generator(10)
print "Ten Genny: ", sys.getsizeof(genny)

#>> Ten Listy:  200
#>> Ten Genny:  168

# 100
listy = list_strings_comprehension(100)
print "Small Listy: ", sys.getsizeof(listy)

genny = list_strings_generator(100)
print "Small Genny: ", sys.getsizeof(genny)

#>> Small Listy:  920
#>> Small Genny:  992

# 1000
listy = list_strings_comprehension(1000)
print "Medium Listy: ", sys.getsizeof(listy)

genny = list_strings_generator(1000)
print "Medium Genny: ", sys.getsizeof(genny)

#>> Medium Listy:  9032
#>> Medium Genny:  8552

# One million
listy = list_strings_comprehension(1000000)
print "Big Listy: ", sys.getsizeof(listy)

genny = list_strings_generator(1000000)
print "Big Genny: ", sys.getsizeof(genny)

#>> Big Listy:  8697472
#>> Big Genny:  8250176

Interestingly, the generator performs better in most cases, execpt for the smallest example with eight strings. With larger lists than this, the generator approach consistently outperforms the list comprehension method in terms of its memory footpring, when building lists of strings and measuring with the sys.getsizeof() function.

pympler asizeof()

The pympler package is reportedly more accurate at deteriming the true memory footprint of a Python object. USing the asizeof() method with the same tests as above, we get:

from pympler import asizeof

listy = list_strings_comprehension(1000)
print "Medium Listy: ", asizeof.asizeof(listy)

genny = list_strings_generator(1000)
print "Medium Genny: ", asizeof.asizeof(genny)

#>> Medium Listy:  57032
#>> Medium Genny:  56552

listy = list_strings_comprehension(100000)
print "Big Listy: ", asizeof.asizeof(listy)

genny = list_strings_generator(100000)
print "Big Genny: ", asizeof.asizeof(genny)

#>> Big Listy:  5624472
#>> Big Genny:  5679848

listy = list_strings_comprehension(1000000)
print "Million Listy: ", asizeof.asizeof(listy)

genny = list_strings_generator(1000000)
print "Million Genny: ", asizeof.asizeof(genny)

#>> Million Listy:  56697472
#>> Million Genny:  56250176

memory_profiler

Another option is the memory_profiler package. This provides another IPython magic command: %memit, which can be used like so:

import gc
gc.collect() # Run the garbage collector first.

%memit -i 0.000001 list_strings_comprehension(1000000)
#>> peak memory: 230.33 MiB, increment: 48.00 MiB

gc.collect()
%memit -i 0.000001 list_strings_generator(1000000)
#>> peak memory: 233.61 MiB, increment: 51.27 MiB

Pympler’s asizeof() says the list comprehension is bigger, memory_profiler says the generator sees the bigger memory footprint…TBC


The Weather Research and Forecasting model (WRF) can be initialised with a range of input data sources for simulations. The initialisation step describes the setting of grid parameters within the model domain (pressure, surface variables, etc.) as well as defining the boundary conditions for the model. If you have followed the excellent WRF tutorial and run a few of the case studies with real data the input data is provided for you and is already tested to ensure it can be pre-processed relatively painlessly by the WPS (WRF pre-processing system). Datsets from North American providers are extenisvely tested with WRF, (i.e. GFS data (global), AWIP (North American continent area))

I’ve recently begun using ECMWF data to intialise WRF simulations and a few extra steps not shown in the standard tutorials were required to get the data pre-processed correctly by WPS. Certain ECWF datasets, such as reanalysis data, can be accessed and downloaded freely from their public data portal. Global reanalysis data is availble for the 20th century and interim data for the most recent years. In this example I’m using the ERA-20C global reanalysis dataset to set up a case study of some severe storm events over Great Britain in 2005. (The ERA-20C actually extends into the 21st century as well now).

The data come in several different sets; as a minimum for intialising the WRF model you will need some surface variable data, and then pressure level data (pressures at different heights in the atmosphere). You could also use model level data directly, but WRF can interpolate this for you from the pressure level data. There is one other data set you need, which is a land-sea surface mask. This is called invariant data, as it does not change over time, and is found under the ‘invariant’ tab on the ECMWF page. Technically, WRF already has invariant data bundled with it but I found I had to download the ECMWF land-sea mask separately for WPS to work correctly.

To summarise, you need three separate data files from the reanalysis data:

  1. Surface variable data
  2. Pressure level data (or model levels)
  3. Land-sea mask.

Downloading the data

You’ll be required to select which surface varibles you want to download, as well as which pressure levels, too. The land-sea mask is just a single file. Although probably not the most efficient method, I tend to just select all the surface variables for the surface data, (you don’t actually need all of them to intiailise the model, but I find it easier to just download everything in case it’s required later). Once you’ve selected the date range, and varaibles of interest, you can proceed directly to the download by clicking the GRIB or netCDF download buttons.

Downloading via Python script

ECMWF have provided a very handy python API for downloading data without necessarily having to use the web interface, which can save time if you already know exactly which data fields you want. The details of the Python API are here, it’s well explained so I will only summarise here what you need to do:

  1. Register on the ECMWF site
  2. Download an access key to placed on the computer or system you are downloading to.
  3. Install the ecmwfapi Python module.
  4. Either write your own python script using the API documentation above, or have the ECMWF web interface generate one for you. (Click the ‘view MARS request’ after making your selections, and then copy the Python script that is displayed. An example python script looks like this:
#!/usr/bin/env python
from ecmwfapi import ECMWFDataServer
server = ECMWFDataServer()
server.retrieve({
    "class": "e2",
    "dataset": "era20c",
    "date": "2009-06-01/to/2009-06-30",
    "expver": "1",
    "levtype": "sfc",
    "param": "15.128/16.128/17.128/18.128/31.128/32.128/33.128/34.128/35.128/36.128/37.128/38.128/39.128/40.128/41.128/42.128/53.162/54.162/55.162/56.162/57.162/58.162/59.128/59.162/60.162/61.162/62.162/63.162/64.162/65.162/66.128/66.162/67.128/67.162/68.162/69.162/70.162/71.162/72.162/73.162/74.162/75.162/76.162/77.162/78.128/78.162/79.128/79.162/80.162/81.162/82.162/83.162/84.162/85.162/86.162/87.162/88.162/89.162/89.228/90.162/90.228/91.162/92.162/131.228/132.228/134.128/136.128/137.128/139.128/141.128/148.128/151.128/159.128/164.128/165.128/166.128/167.128/168.128/170.128/174.128/183.128/186.128/187.128/188.128/198.128/206.128/229.128/230.128/231.128/232.128/235.128/236.128/238.128/243.128/244.128/245.128/246.228/247.228",
    "stream": "oper",
    "time": "00:00:00",
    "type": "an",
    "target": "CHANGEME",
})

ECMWF provide the data in two different file formats, GRIB (gridded-binary) and netCDF (.nc files). WPS comes with the ungribber tool (ungrib.exe) so I’ve gone for the grib data format here. (Selected by default in the Python download script).

Important: Retrieveing the data in Gaussian gridded format

By default, the ECMWF site will download the data on a spherical harmonic grid, and the version of ungrib supplied in WPS v3.8.1 will not be able to decode this properly. (You may get the error Unknown ksec2(4): 50 in the ungrib error log, if you try this). Ungrib expects a regular Gaussian grid, and I couldn’t find a way to easily access this from the web portal interface. However, using the Python script above, you can easily request Gaussian gridded data by adding the parameter:

"grid": "160",

to the python download script, and your grib data will be supplied in Gaussian grid format. The CHANGEME value is the name of the downloaded file and you should probably change it to something meaningful. The python script will download the grib file to the same directory it is run in. For the example in this post, I ended up with three python scripts, one for the surface data, one for pressure levels, and one for the land-sea mask. (You could of course bundle them all in to one script).

Ungribbing the data

Now we need to ‘ungrib’ data to convert into the WPS intermediate file format, before running metgrid. This has to be done in two stages - one for the surface and pressure level data, and one for the land-sea mask. This is because the land-sea mask has a ‘start date’ of 1900-01-01 if you try to run ungrib with the date of your case study, ungrib will fail, complaining that the dates specified could not be found. In the namelist.wps file, I set the the &ungrib section to the following:

&ungrib
  out_format = 'WPS',
  prefix = 'FIX',

Link your land-sea mask to the WPS directory with the link_grib.sh script and run ungrib. Repeat again for the surface and pressure data but with the following section in the namelist.wps file:

&ungrib
  out_format = 'WPS',
  prefix = 'FILE',

To be honest, you can use whatever file prefixes you like, but I like to use this naming convention.

Make sure you have linked the correct Vtable before running ungrib. The ECMWF Vtable supplied with WRF v3.8.1 worked fine without any modifications in this case. The Vtable can be found in WPS/ungrib/Variable_Tables/Vtable.ECMWF

Metgrid

Metgrid interpolates your ungribbed data files over the model domain. (I haven’t gone through the model domain generation stage with geogrid.exe as this blog post is only about prepping the input data.)

If you are using the sea surface temparatures field from the ECMWF data, you’ll need to make a few changes to the METGRID.TBL file for it to correctly interpolate and mask the sea surface temparatures around land. IN the METGRID.TBL file (located in WPS/metgrid/), change the entry of the SST field to the following:

SST
  interp_option=sixteen_pt+four_pt+wt_average_4pt+wt_average_16pt+search
  missing_value=-1.E30
  masked=land
  interp_mask=LANDMASK(1)
  fill_missing=0.
  flag_in_output=FLAG_SST 

The changes are to make the interp_mask use the LANDMASK mask instead of LANDSEA (the default), and to change the interpolation option slightly. Without the changes, I found that for my inner domain the sea surface temperatures were incorrectly masked, and had been interpolated over land as well. Although metgrid.exe did not complain when run, the met files generated caused an error when real.exe was run - generating an error message saying:

-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE:  <stdin>  LINE:    2970
mismatch_landmask_ivgtyp
-------------------------------------------

The changes to METGRID.TBL should remedy this error message.

Before running metgrid, there is one last change to make to the namelist.wps file:

&metgrid
  fg_name = 'FILE',
  constants_name = 'FIX:1900-01-01_00',

This tells metgrid to use the invariant data we downloaded earlier as a constant field (i.e. the land-sea mask doesn’t change over time, so it doesn’t need to be interpolated for each period). Use whichever name you have used for this invariant file.

Now you can run metgrid. Hopefully it will produce all the input met files needed, and correctly interpolated. It’s a good idea before running real.exe and wrf.exe to check that the fields look reasonable. In particular, check the SST field if you are using sea-surface temperatures. Check the nested domains as well - I found that my SST field had been incorrectly interpolated in the inner domain, which required the change to the METGRID.TBL file above. ncview is a useful utility for checking the met files generated by metgrid.

Real and WRF

You are now set to run real.exe to generate the lateral boundary conditions and initialise the model, followed (finally) by wrf.exe to run the simulation.

Sources

The following discussion boards and mailing list answers were helpful in preparing this post:

Also two useful blogposts from NCAR: Analysis, Forecast, Reanalysis, What’s the difference?.

WRF-able datasets.


Some interesting new OpenMP functions in OpenMP 4.5, including the potentially useful reduction on arrays for C and C++ now (this was previously supported for Fortran only).

You can now perform summations of arrays using the reduction clause with OpenMP 4.5.

Reductions can be done on variables using syntax such as:

double total_sum = 0.0;
int imax = 100;

#pragma omp parallel for reduction(+:total_sum)
for (int i=0; i<imax; i++)
{
  total_sum += 42;
}

So each thread gets its own copy of total_sum, and at the end of the parallel for region, all the local copies of total_sum are summed up to get the grand total.

Suppose you have an array where you want to summation of each array element (not the entire array). Previously, you would have to implement this manually.

#include <iostream>

int main()
{
      
  int myArray[6] = {};

#pragma omp parallel
{
  int private_myArray[6] = {};
  #pragma omp for 
  for (int i=0; i<50; ++i)
  {
    double a = 2.0; // Or something non-trivial justifying the parallelism...
    for (int n = 0; n<6; ++n)
    {
      private_myArray[n] += a;
    }
  }
  
  #pragma omp critical
  for (int n = 0; n<6; ++n)
  {
    myArray[n] += private_myArray[n];
  }
}

  // Print the array elements to see them summed   
  for (int n = 0; n<6; ++n)
  {
    std::cout << myArray[n] << " " << std::endl;
  } 
}

Whereas now in OpenMP 4.5 you can do

int main()
{

  int myArray[6] = {};

  #pragma omp parallel for reduction(+:myArray[:6])
  for (int i=0; i<50; ++i)
  {
    double a = 2.0; // Or something non-trivial justifying the parallelism...
    for (int n = 0; n<6; ++n)
    {
      myArray[n] += a;
    }
  }
  // Print the array elements to see them summed   
  for (int n = 0; n<6; ++n)
  {
    std::cout << myArray[n] << " " << std::endl;
  } 
}

Outputs:

    100
    100
    100
    100
    100
    100

I compiled this with GCC 6.2. You can see which common compiler versions support the OpenMP 4.5 features here: http://www.openmp.org/resources/openmp-compilers/


This post documents how to compile the latest release of WRF (version 3.8.1) on the ARCHER HPC service. There is already a compiled version available on archer that be accessed using the modules funciton (see this guide), but you will need to be able to compile from the source code if you are making any modifications to the code, or if you need to compile the idealised cases, or need to compile it with a different nesting set up. (The pre-compiled code is set up for basic nesting only).

Compilation with the default ARCHER compiler (Cray CCE)

This is relatively straightforward as the configure script already works as expected. However, compiling with the Cray compiler (usually the default loaded compiler when you login to ARCHER) can take upwards of 6-8 hours, depending on the options selected in configure, and the load on the login or serial nodes.

Setting up your ARCHER environment

First, check that you do actually have the Cray compiler environment loaded. You can look for PrgEnv-cray in the output from running the module list command, or you can do echo $PE_ENV from a login session.

Because it takes so long to compile using the Cray compiler, you need to run the compilation task as a serial job on the serial nodes. If you attempt to run on the login nodes, the compilation process will time out well before completion.

So we are going to prepare three things:

  1. A ‘pre build’ script that will load the correct modules on ARCHER and set up the environment variables
  2. The configure.wrf file. This is prepared using the configure script, so you should not need to do anything different to the normal WRF compilation instructions for this part.
  3. A compliation job submission (.pbs) script. This will be submitted as a serial node job to do the actual compilation bit.

The pre-build script

This is a shell script (of the bash flavour) that loads the relevant modules for WRF to compile.

pre-build.bash

# Set the cray compiler wrappers. Cray uses the same wrappers,
# regardless of which compiler you actually have loaded, be it
# gfortran, intel, or Cray's own compiler
export CC=cc FC=ftn F77=ftn F90=ftn CXX=CC

module load cray-libsci

# If you are compiling wrf with the netcdf 4 parallel support
module load cray-netcdf-hdf5parallel
# If you are just using standard netcdf 4:
# module load cray-netcdf

module load libpng
module load jasper
module load ncl

# This is needed for some other tools that are built, note,
# it does not overwrite the Cray compiler option for the 
# main bulk of the code's compilation
module load gcc/6.1.0

# continues...

NetCDF-4 is the default netcdf module on the ARCHER environment, so I am assuming you want to compile it with netcdf-4 (it has extra features like supporting file compression etc…)

I attempted this with the gcc/5.x.x module, but ran into compilation errors. Using GCC v6 seemed to fix them.

Note that you may need to switch a load statement for a swap statement in some places, depending on what default modules are loaded in your ARCHER environment. See your .profile and .bashrc scripts.

NOTE: .profile (in the $HOME directory is only loaded on the login shells. If you launch any other interactive shells (like an interactive job mode), then .bashrc will get loaded.

pre-build.bash (continued)

# ...continued

# This next bit sets the environment variables used in the WRF configure script.
# These must be set up to match the modules you have loaded above. If you haven't
# loaded the correct module, the environment variable will not work.

export FORTRAN_COMPILER_TIMER='time -p'
export J='-j -1'
# This sets compilation to take place in serial, using only one thread.

export NETCDF=$NETCDF_DIR
export HDF5=$HDF5_DIR

# large file support seems to be set by default in the configure.wrf
# file that is generated but explicitly set it anyway
export WRFIO_NCD_LARGE_FILE_SUPPORT=1
# Parallel netCDF is experimental.  Using Parallel netCDF with the
# correct Lustre striping make an enormous difference to I/O time.
# Try a striping factor like SQRT(numprocs) for best results.
# PARALLEL_NETCDF_DIR is set in cray-parallel-netcdf.
export PNETCDF=$PARALLEL_NETCDF_DIR
# use netCDF-4 compression (note that Parallel netCDF is compatible
# with netCDF-4 since November 2013)
export NETCDF4=1

# WPS
# JASPER_DIR is set in jasper.  Note that Jasper isn't required for
# WRF, only for WPS.
export JASPERINC=$JASPER_DIR/include
export JASPERLIB=$JASPER_DIR/lib

# WRF
# Choose the core in the build script
export WRF_EM_CORE=1
export WRF_NMM_CORE=0
export WRF_DA_CORE=0

# WRF-Chem (this does NOT work with shared memory parallelism, see the
# WRF-Chem User's Guide Section 2.2.3
# http://ruc.noaa.gov/wrf/WG11/Users_guide.pdf)
# Choose Chem in the build script
export WRF_CHEM=0
# start of KPP stuff
export WRF_KPP=0
# byacc is needed

Some things to note:

  1. You cannot compile WRF in parallel (it’s a bug in the Makefile, apparrently)
  2. There are some environment variables that are just set equal to some other environmnet variables, e.g. NETCDF=$NETCDF_DIR. This works because when you use the module system to load, say, netCDF, ARCHER will automatically set its own environment variables that we can use to initialise the WRF configure variables, e.g. $NETCDF.

Run configure

For the CRAY compiler, this can be run as normal e.g. ./configure from the WRFV3 directory.

Compilation job script

At this stage, you can either request an interactive mode job on the serial nodes, and then run compile in the usual way (after running the prebuild script and the configure commands), or you can submit a serial job with the PBS job scheduling system to run when a node becomes available. If you are going down the interactive job mode, be sure to request enough walltime as the Cray compiler takes a long time to compile everything. I would ask for 12 hours to be on the safe side.

If you want to submit a job to run without having to wait for an interactive-mode job, prepare the following job submission script:

#!/bin/bash --login

# This script needs to be qsubbed in the build directory.

#PBS -q standard
#PBS -N CrayWRF_build
#PBS -l select=serial=true:ncpus=1
#PBS -l walltime=12:00:00
#PBS -A [YOUR ACCOUNT BUDGE CODE]
#PBS -V

# Make sure any symbolic links are resolved to absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)

# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR/WRF_build381

# The configuration has to be done already, on the login nodes, and
# the PBS -V directive (see above) gets the environment that has been
# set up.  The compilation takes about 1 hour ('make -j 1') on a
# serial node.

# However, I don't trust the -V mode, so I run the environment set
# up script again:

./pre-build.bash

# The compile step is run on the serial nodes because the compile
# takes so long, and some optimisation steps take so long that the
# /tmp directory is emptied by the system during the optimisations,
# giving 'file not found' errors.  This seems to be occurring on the
# serial nodes as well now so $TMPDIR is set; this may affect Cray
# Fortran OPEN statements for scratch files but there are none of
# these in WRF, WPS or Chem.

mkdir -p tmp
export TMPDIR=$PWD/tmp

# WRF; ARW core
(
    cd WRFV3
    ./compile em_real &> compileCray.log
)

# WPS
(
    cd WPS
    ./compile &> compileCray.log
)

unset TMPDIR
rm -rf tmp

Putting it all together.

  1. Make sure the pre-build.bash script and the compile.pbs script are in a directory above /WRFV3. I called it WRF_build381

  2. Use qsub as normal to submit the compile.pbs script. E.g. qsub compile.pbs

Your job should run and the compile logs will be written to compileCray.log (or whatever you named them in the compile.log script above.

Compiling WRF using the GNU compilers on ARCHER

You may have reason to want to compile WRF with the GNU compilers on ARCHER or another Cray XC30 system. Unfortunately I found that the configure script supplied with version 3.8.1 did not generate a correct configure.wrf script for the GNU compilers in a Cray enviroment. Namely, it used compilation flags specific to the Cray compiler, rather than the gfortran compilation flags (which are incompatible). To rectify this you can either run the configure script as normal, and then correct the compiler flags in the configure.wrf output script that is generated. Or if you want a more re-usable soultion you can edit the file in the WRFV3/arch/configure_new.defaults file.

I did this by opening the configure_new.defaults file and adding a new entry. The purpose of the file is to generate the menu entries that you see when running the configure script, and then populate the Makefile with the correct compilation options.

Find the CRAY CCE entry in the configure_new.defaults file and insert a new entry below it called GNU on CRAY XC30 system or similar. The entry should contain the following:

###########################################################
#ARCH    Cray XE and XC CLE/Linux x86_64, GNU Compiler on Cray System # serial dmpar smpar dm+sm
# Use this when you are using the GNU programming environment on ARCHER (a Cray system)

DESCRIPTION     =       GNU on Cray system ($SFC/$SCC): Cray XE and XC
# OpenMP is enabled by default for Cray CCE compiler
# This turns it off
DMPARALLEL      =       # 1
OMPCPP          =       # -D_OPENMP
OMP             =       # -fopenmp
OMPCC           =       # -fopenmp
SFC             =       ftn
SCC             =       cc
CCOMP           =       gcc
DM_FC           =       ftn
DM_CC           =       cc
FC              =       CONFIGURE_FC
CC              =       CONFIGURE_CC
LD              =       $(FC)
RWORDSIZE       =       CONFIGURE_RWORDSIZE
PROMOTION       =       #-fdefault-real-8
ARCH_LOCAL      =       -DNONSTANDARD_SYSTEM_SUBR  -DWRF_USE_CLM
CFLAGS_LOCAL    =       -O3
LDFLAGS_LOCAL   =
CPLUSPLUSLIB    =
ESMF_LDFLAG     =       $(CPLUSPLUSLIB)

FCOPTIM         =       -O2 -ftree-vectorize -funroll-loops
FCREDUCEDOPT    =       $(FCOPTIM)
FCNOOPT         =       -O0
FCDEBUG         =       # -g $(FCNOOPT) # -ggdb -fbacktrace -fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow
FORMAT_FIXED    =       -ffixed-form
FORMAT_FREE     =       -ffree-form -ffree-line-length-none
FCSUFFIX        =
BYTESWAPIO      =       -fconvert=big-endian -frecord-marker=4
FCBASEOPTS_NO_G =       -w $(FORMAT_FREE) $(BYTESWAPIO)
FCBASEOPTS      =       $(FCBASEOPTS_NO_G) $(FCDEBUG)
FCBASEOPTS_NO_G =       -N1023 $(FORMAT_FREE) $(BYTESWAPIO) #-ra
FCBASEOPTS      =       $(FCBASEOPTS_NO_G) $(FCDEBUG)
MODULE_SRCH_FLAG =
TRADFLAG        =      -traditional
CPP             =      /lib/cpp -P
AR              =      ar
ARFLAGS         =      ru
M4              =      m4 -G
RANLIB          =      ranlib
RLFLAGS         =
CC_TOOLS        =      gcc


###########################################################

You can run the configure script as normal once these changes have been made and you will get a configure.wrf suitable for using the GNU compilers on ARCHER to build WRF v3.8.1.


This post documents how I set up an NVIDIA CUDA GPU card on linux, specifically for CUDA computing (i.e. using it solely for GPGPU purposes). I already had a separate (AMD) graphics card I used for video output, and I wanted the NVIDIA card to be used only for computation, with no video use.

Time for some PyCUDA...after hours trying to set the thing up

I found the whole process of setting up this card under linux to be problematic. NVIDIA’s own guidance on their website did not seem to work (OK, in fairness, you can sort of figure it out from the expanded installation guide pdf), and in the end it required patching together bits of information from different sources. Here it is for future record:

  • Card: NVIDIA Quadro K1200 (PNY Low profile version)
  • Computer: HP desktop, integrated graphics on motherboard (Disabled in BIOS - though in hindsight I don’t know if this was really necessary.)
  • Linux version(s): Attempted on Fedora 23 (FAIL), Scientific Linux 6.7 (OK), CentOS 7.2 (OK). Officially, the installation scripts/packages only support Fedora 21, but I thought I would give it a try at least.

1st Attempt: Using the NVIDA rpm package (FAIL)

This is the recommended installation route from NVIDIA. Bascially you download the relevant package manager install package. I was using CentOS so downloaded the RHEL/CentOS 7 .rpm file. You then add this to your package manager (e.g. yum). For RHEL/CentOS, you must have the epel-release repository enabled in yum:

yum install epel-release

Then you add the rpm package downloaded from nvidia:

rpm --install cuda-repo<...>.rpm

Followed by:

yum clean expire-cache
yum install cuda

It will install a load of package dependencies, the CUDA package, as well as the proprietary NVIDIA drivers for the card. I rebooted, only to find I could no longer launch CentOS in graphical mode. It would hang when trying to load the X-server driver files on boot. Only a text interface login was possible. Further playing around with the linux system logs showed there was a conflict with some of the OpenGL X11 libraries being loaded.

I reverted to the earlier working state by launching in text mode and using yum history undo to revert all the installed packages in the previous step.

2nd Attempt: Using the NVIDIA runfile shell script (SUCCESS)

A second alternative is provided by NVIDIA, involving a shell script that installs the complete package as a platform-independent version. It bypasses the package manager completely and installs the relevant headers and drivers “manually”. NVIDIA don’t recommend this unless they don’t supply a ready-made package for your OS, but I had already tried packages for Scientific Linux/RedHat, CentOS, and Fedora without success.

Before you go anywhere near the NVIDIA runfile.sh script, you have to blacklist the nouveau drivers that will may be installed. These are open source drivers for NVIDIA cards, but will create conflicts if you try to use them alongside the proprietary NVIDIA ones.

You blacklist them by adding a blacklist file to the modprobe folder, which controls which drivers load at the linux boot-up.

vim /etc/modprobe.d/blacklist-nouveau.conf

Add the following lines:

blacklist nouveau
options modeset=0

Now rebuild the startup script with:

dracut --force

Now the computer has to be restarted in text mode. The install script cannot be run if the desktop or X server is running. To do this I temporarily disabled the graphical/desktop service from starting up using systemctl, like this:

systemctl set-default multi-user.target

Then reboot. You’ll be presented with a text-only login interface. First check that the nouveau drivers haven’t been loaded:

lsmod | grep nouveau Should return a blank. If you get any reference to nouveau in the output, something has gone wrong when you tried to blacklist the drivers. Onwards…

Navigate to your NVIDIA runfile script after logging in. Stop there.

Buried in the NVIDIA documentation is an important bit of information if you are planning on running the GPU for CUDA processing only, i.e. a separate, standalone card for GPGPU use, with another card for your video output. Theu note that installing the OpenGL library files can cause conflicts with the X-window server (Now they tell us!), but an option flag will disable their installation. Run the install script like so:

sh cuda_<VERSION>.run --no-opengl-libs

The option at the end is critical for it to work. I missed it off during one previous failed attempt and couldn’t properly uninstall what I had done. The runfile does have an --uninstall option, but it’s not guaranteed to undo everything.

You’ll be presented with a series of text prompts, read them, but I ended up selecting ‘yes’ to most questions, and accepting the default paths. Obviously you should make ammendments for your own system. I would recommend installing the sample programs when it asks you so you can check the installation has worked and the card works as expected.

After that has all finished, you need to set some environment paths in your .bash_profile file. Add the following:

PATH=$PATH:/usr/local/cuda-7.5/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-7.5/lib64

If you have changed the default paths during the installation process, ammend the above lines to the paths you entered in place of the defaults.

Now, you have to remember to restore the graphical/desktop service during boot up. (Assuming you used the systemctl method above). Restore with:

systemctl set-default graphical.target

Then reboot. It should work, hopefully!

Assuming you can login into your desktop without problems, you can double check the card is running fine, and can execute CUDA applications by compiling one of the handy sample CUDA applications called deviceQuery. Navigate to the path where you installed the sample CUDA programs, go into the utilities folder, into deviceQuery, and run make. You will get an application called deviceQuery that prints out lots of information about your CUDA graphics card. There are loads of other sample applications (less trivial than this one) that you can also compile and test in these folders.

Remember, if you have followed the above steps, you can only use your CUDA card for computation, not graphical output.