Some interesting new OpenMP functions in OpenMP 4.5, including the potentially useful reduction on arrays for C and C++ now (this was previously supported for Fortran only).
You can now perform summations of arrays using the reduction clause with OpenMP 4.5.
Reductions can be done on variables using syntax such as:
So each thread gets its own copy of total_sum, and at the end of the parallel for region, all the local copies of total_sum are summed up to get the grand total.
Suppose you have an array where you want to summation of each array element (not the entire array). Previously, you would have to implement this manually.
Whereas now in OpenMP 4.5 you can do
I compiled this with GCC 6.2. You can see which common compiler versions support the OpenMP 4.5 features here: http://www.openmp.org/resources/openmp-compilers/
This post documents how to compile the latest release of WRF (version 3.8.1) on the ARCHER HPC service. There is already a compiled version available on archer that be accessed using the modules funciton (see this guide), but you will need to be able to compile from the source code if you are making any modifications to the code, or if you need to compile the idealised cases, or need to compile it with a different nesting set up. (The pre-compiled code is set up for basic nesting only).
Compilation with the default ARCHER compiler (Cray CCE)
This is relatively straightforward as the configure script already works as expected. However, compiling with the Cray compiler (usually the default loaded compiler when you login to ARCHER) can take upwards of 6-8 hours, depending on the options selected in configure, and the load on the login or serial nodes.
Setting up your ARCHER environment
First, check that you do actually have the Cray compiler environment loaded. You can look for
PrgEnv-cray in the output from running the
module list command, or you can do
echo $PE_ENV from a login session.
Because it takes so long to compile using the Cray compiler, you need to run the compilation task as a serial job on the serial nodes. If you attempt to run on the login nodes, the compilation process will time out well before completion.
So we are going to prepare three things:
- A ‘pre build’ script that will load the correct modules on ARCHER and set up the environment variables
- The configure.wrf file. This is prepared using the
configurescript, so you should not need to do anything different to the normal WRF compilation instructions for this part.
- A compliation job submission (.pbs) script. This will be submitted as a serial node job to do the actual compilation bit.
The pre-build script
This is a shell script (of the bash flavour) that loads the relevant modules for WRF to compile.
NetCDF-4 is the default netcdf module on the ARCHER environment, so I am assuming you want to compile it with netcdf-4 (it has extra features like supporting file compression etc…)
I attempted this with the gcc/5.x.x module, but ran into compilation errors. Using GCC v6 seemed to fix them.
Note that you may need to switch a
load statement for a
swap statement in some places, depending on what default modules are loaded in your ARCHER environment. See your
.profile (in the
$HOME directory is only loaded on the login shells. If you launch any other interactive shells (like an interactive job mode), then
.bashrc will get loaded.
Some things to note:
- You cannot compile WRF in parallel (it’s a bug in the Makefile, apparrently)
- There are some environment variables that are just set equal to some other environmnet variables, e.g.
NETCDF=$NETCDF_DIR. This works because when you use the
modulesystem to load, say, netCDF, ARCHER will automatically set its own environment variables that we can use to initialise the WRF configure variables, e.g.
For the CRAY compiler, this can be run as normal e.g.
./configure from the
Compilation job script
At this stage, you can either request an interactive mode job on the serial nodes, and then run compile in the usual way (after running the prebuild script and the configure commands), or you can submit a serial job with the PBS job scheduling system to run when a node becomes available. If you are going down the interactive job mode, be sure to request enough walltime as the Cray compiler takes a long time to compile everything. I would ask for 12 hours to be on the safe side.
If you want to submit a job to run without having to wait for an interactive-mode job, prepare the following job submission script:
Putting it all together.
Make sure the
pre-build.bashscript and the
compile.pbsscript are in a directory above
/WRFV3. I called it
Use qsub as normal to submit the compile.pbs script. E.g.
Your job should run and the compile logs will be written to compileCray.log (or whatever you named them in the
compile.log script above.
Compiling WRF using the GNU compilers on ARCHER
You may have reason to want to compile WRF with the GNU compilers on ARCHER or another Cray XC30 system. Unfortunately I found that the
configure script supplied with version 3.8.1 did not generate a correct
configure.wrf script for the GNU compilers in a Cray enviroment. Namely, it used compilation flags specific to the Cray compiler, rather than the gfortran compilation flags (which are incompatible). To rectify this you can either run the
configure script as normal, and then correct the compiler flags in the
configure.wrf output script that is generated. Or if you want a more re-usable soultion you can edit the file in the
I did this by opening the
configure_new.defaults file and adding a new entry. The purpose of the file is to generate the menu entries that you see when running the
configure script, and then populate the Makefile with the correct compilation options.
Find the CRAY CCE entry in the
configure_new.defaults file and insert a new entry below it called
GNU on CRAY XC30 system or similar. The entry should contain the following:
########################################################### #ARCH Cray XE and XC CLE/Linux x86_64, GNU Compiler on Cray System # serial dmpar smpar dm+sm # Use this when you are using the GNU programming environment on ARCHER (a Cray system) DESCRIPTION = GNU on Cray system ($SFC/$SCC): Cray XE and XC # OpenMP is enabled by default for Cray CCE compiler # This turns it off DMPARALLEL = # 1 OMPCPP = # -D_OPENMP OMP = # -fopenmp OMPCC = # -fopenmp SFC = ftn SCC = cc CCOMP = gcc DM_FC = ftn DM_CC = cc FC = CONFIGURE_FC CC = CONFIGURE_CC LD = $(FC) RWORDSIZE = CONFIGURE_RWORDSIZE PROMOTION = #-fdefault-real-8 ARCH_LOCAL = -DNONSTANDARD_SYSTEM_SUBR -DWRF_USE_CLM CFLAGS_LOCAL = -O3 LDFLAGS_LOCAL = CPLUSPLUSLIB = ESMF_LDFLAG = $(CPLUSPLUSLIB) FCOPTIM = -O2 -ftree-vectorize -funroll-loops FCREDUCEDOPT = $(FCOPTIM) FCNOOPT = -O0 FCDEBUG = # -g $(FCNOOPT) # -ggdb -fbacktrace -fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow FORMAT_FIXED = -ffixed-form FORMAT_FREE = -ffree-form -ffree-line-length-none FCSUFFIX = BYTESWAPIO = -fconvert=big-endian -frecord-marker=4 FCBASEOPTS_NO_G = -w $(FORMAT_FREE) $(BYTESWAPIO) FCBASEOPTS = $(FCBASEOPTS_NO_G) $(FCDEBUG) FCBASEOPTS_NO_G = -N1023 $(FORMAT_FREE) $(BYTESWAPIO) #-ra FCBASEOPTS = $(FCBASEOPTS_NO_G) $(FCDEBUG) MODULE_SRCH_FLAG = TRADFLAG = -traditional CPP = /lib/cpp -P AR = ar ARFLAGS = ru M4 = m4 -G RANLIB = ranlib RLFLAGS = CC_TOOLS = gcc ###########################################################
You can run the configure script as normal once these changes have been made and you will get a
configure.wrf suitable for using the GNU compilers on ARCHER to build WRF v3.8.1.
This post documents how I set up an NVIDIA CUDA GPU card on linux, specifically for CUDA computing (i.e. using it solely for GPGPU purposes). I already had a separate (AMD) graphics card I used for video output, and I wanted the NVIDIA card to be used only for computation, with no video use.
I found the whole process of setting up this card under linux to be problematic. NVIDIA’s own guidance on their website did not seem to work (OK, in fairness, you can sort of figure it out from the expanded installation guide pdf), and in the end it required patching together bits of information from different sources. Here it is for future record:
- Card: NVIDIA Quadro K1200 (PNY Low profile version)
- Computer: HP desktop, integrated graphics on motherboard (Disabled in BIOS - though in hindsight I don’t know if this was really necessary.)
- Linux version(s): Attempted on Fedora 23 (FAIL), Scientific Linux 6.7 (OK), CentOS 7.2 (OK). Officially, the installation scripts/packages only support Fedora 21, but I thought I would give it a try at least.
1st Attempt: Using the NVIDA rpm package (FAIL)
This is the recommended installation route from NVIDIA. Bascially you download the relevant package manager install package. I was using CentOS so downloaded the RHEL/CentOS 7
.rpm file. You then add this to your package manager (e.g. yum). For RHEL/CentOS, you must have the
epel-release repository enabled in yum:
yum install epel-release
Then you add the rpm package downloaded from nvidia:
rpm --install cuda-repo<...>.rpm
It will install a load of package dependencies, the CUDA package, as well as the proprietary NVIDIA drivers for the card. I rebooted, only to find I could no longer launch CentOS in graphical mode. It would hang when trying to load the X-server driver files on boot. Only a text interface login was possible. Further playing around with the linux system logs showed there was a conflict with some of the OpenGL X11 libraries being loaded.
I reverted to the earlier working state by launching in text mode and using
yum history undo to revert all the installed packages in the previous step.
2nd Attempt: Using the NVIDIA runfile shell script (SUCCESS)
A second alternative is provided by NVIDIA, involving a shell script that installs the complete package as a platform-independent version. It bypasses the package manager completely and installs the relevant headers and drivers “manually”. NVIDIA don’t recommend this unless they don’t supply a ready-made package for your OS, but I had already tried packages for Scientific Linux/RedHat, CentOS, and Fedora without success.
Before you go anywhere near the NVIDIA runfile.sh script, you have to blacklist the
nouveau drivers that will may be installed. These are open source drivers for NVIDIA cards, but will create conflicts if you try to use them alongside the proprietary NVIDIA ones.
You blacklist them by adding a blacklist file to the modprobe folder, which controls which drivers load at the linux boot-up.
Add the following lines:
Now rebuild the startup script with:
Now the computer has to be restarted in text mode. The install script cannot be run if the desktop or X server is running. To do this I temporarily disabled the graphical/desktop service from starting up using
systemctl, like this:
systemctl set-default multi-user.target
Then reboot. You’ll be presented with a text-only login interface. First check that the nouveau drivers haven’t been loaded:
lsmod | grep nouveau Should return a blank. If you get any reference to nouveau in the output, something has gone wrong when you tried to blacklist the drivers. Onwards…
Navigate to your NVIDIA runfile script after logging in. Stop there.
Buried in the NVIDIA documentation is an important bit of information if you are planning on running the GPU for CUDA processing only, i.e. a separate, standalone card for GPGPU use, with another card for your video output. Theu note that installing the OpenGL library files can cause conflicts with the X-window server (Now they tell us!), but an option flag will disable their installation. Run the install script like so:
sh cuda_<VERSION>.run --no-opengl-libs
The option at the end is critical for it to work. I missed it off during one previous failed attempt and couldn’t properly uninstall what I had done. The runfile does have an
--uninstall option, but it’s not guaranteed to undo everything.
You’ll be presented with a series of text prompts, read them, but I ended up selecting ‘yes’ to most questions, and accepting the default paths. Obviously you should make ammendments for your own system. I would recommend installing the sample programs when it asks you so you can check the installation has worked and the card works as expected.
After that has all finished, you need to set some environment paths in your .bash_profile file. Add the following:
If you have changed the default paths during the installation process, ammend the above lines to the paths you entered in place of the defaults.
Now, you have to remember to restore the graphical/desktop service during boot up. (Assuming you used the systemctl method above). Restore with:
systemctl set-default graphical.target
Then reboot. It should work, hopefully!
Assuming you can login into your desktop without problems, you can double check the card is running fine, and can execute CUDA applications by compiling one of the handy sample CUDA applications called
deviceQuery. Navigate to the path where you installed the sample CUDA programs, go into the utilities folder, into deviceQuery, and run
make. You will get an application called deviceQuery that prints out lots of information about your CUDA graphics card. There are loads of other sample applications (less trivial than this one) that you can also compile and test in these folders.
Remember, if you have followed the above steps, you can only use your CUDA card for computation, not graphical output.
I run the LSDCatchmentModel (soon to be released as HAIL-CAESAR package…) on the ARCHER supercomputing facility on single compute nodes. I.e. one instance of the program per node, using a shared-memory parallelisation model (OpenMP). Recently, I’ve being trying to find the optimum setup of CPUs/Cores/Threads etc per node. (While trying not to spend too much time on it!). Here are some of the notes:
I will write this up as a more detailed post later, but all these tests are done using the Cray compiler, a license for which is available on the ARCHER HPC. In general I’ve found this offers the best performance over the intel and gnu compilers, but more investigation is warranted.
The executable LSDCatchmentModel.out was compiled with the -O2 level of optimisation and the
hstd=c++11 compiler flag.
ARCHER compute nodes
Each node on ARCHER consists of two Intel Xeon processors, each with 32GB of memory. (So 64GB in total for the whole node, which either processor can access). Each one of these CPUs, with its corresponding memory, forms what is called a NUMA node or NUMA-region. It is generally much faster for a CPU to access its local NUMA node memory, but the full 64GB is available. Accessing the “remote” NUMA region will have higher latency and may slow down performance.
Options for Optimisation
Programs on ARCHER are launched with the
aprun command, which requests the number of resources you want and their configuration. There are a vast number of options/arguments you can specify with this command, but I’ll just note the important ones here:
-n [NUMBER] - The number of “Processing Elements”, i.e. the number of instances of your executable. Just running a single 1 in this case.
-d [NUMBER] - The number of “CPUs” to use for the program. Here a CPU refers to any core or virtual core. On the ARCHER system, each physcial processor has 12 physical cores, so a total of 24 “CPUs” in total. I use “CPU” hereafter. With Intel’s special hyperthreading technology turned on, you actually get double the number of logical CPUs, so 48 CPUs in total.
-j 2 - Turns on hyperthreading as above. Default is off (
-j 1, but no need to specify if you want to leve it off.)
-sn [1 or 2] the number of NUMA regions to use per program. You can limit the CPUS that are allocated to a single NUMA node, which may (or may not) give you a performance boost. by default processes are allocated on a single NUMA node until it is full up, then it moves on to the next one.
-ss “Strict segmentation’. Means that each CPU is limited to accessing the local 32GB of memory and cannot be allocated more than 32GB. If more than 32GB is needed, the program will crash.
-cc [NUMBER OR RANGE] CPU affinity, i.e. which CPUs to allocate to. Each logical CPU on the compute node has a number [0-23] or [0-47] with hyperthreading turned on. The numbering of CPUs is slightly counterintuitive. The first physical processor has CPUs [0-11] and [24-36] if hyperthreading is turned on. The second physical processor has CPUs numbered [12-23] and [37-48] if hyperthreading is turned on.
A typical aprun command looks like this:
aprun -n 1 -d 24 -j 2 ./LSDCatchmentModel.out ./directory paramfile.params
ARCHER recommend to use only a single NUMA node when running OpenMP programs (i.e. don’t spread processes between physcial processors) but actually I have found for many cases, LSDCatchmentModel get the best performance increase from maxing out the number of CPUs, and turning on hyperthreading in some cases. There is not one rule, however, and different datasets can have different optimum compute node settings. For the small boscastle catchment, for example, running
aprun -n 1 -d 48 -j 2 ... produced the fastest model run, which is contrary to what ARCHER recommend. (They suggest not turning on hyperthreading as well, for example.
OpenMP threads are not the same thing as CPUs. If you have 24 CPUs for example, your program will not automatically create 24 threads. In fact on ARCHER the default is just one thread! You can set this before you run aprun with:
export OMP_NUM_THREADS=24 or however many you want. I haven’t really experimented with having different amounts of threads to availble CPUs, so normally I just set threads to the same number of CPUs requested with the
C++ can make use of native C libraries and header files. (As long as there is no incompatible stuff in the C implementation that will not compile as valid C++, there are only a few of these exceptions).
#include "c_header.h" will not work, however. Instead use the
extern keyword like so:
Then compile as follows:
g++ -c my_c_source.c my_main.cpp -o myExec.out
(remember to have the sources in the right order and before the executable!)