Building and Using LLVM/Clang 7.0 with OpenMP Offloading to NVIDIA GPUs

Current processors nowaydays have mutliple cores. To make efficient use of their computing power one option is so-called “Shared Memory” parallelization. This usually means employing multiple threads that have access to the same “shared” memory. In that area OpenMP is a de-facto standard that uses compiler directives. Clang 3.7 introduced support for OpenMP 3.1 on the host.

However some of today’s most powerful systems in the world have a heterogeneous architecture. Their compute power largely comes from accelerators, for example GPUs. This requires different approaches than parallelization on the host, one of which is referred to as “offloading”: Execution starts on the host and designated parts are sent to the attached device. As one programming model OpenMP added target directives in its version 4.0 in 2013.

Clang 7.0, released in September 2018, has support for offloading to NVIDIA GPUs. In this blog post I’m going to explain how to build the Clang compiler on Linux.

0. Determine GPU Architectures

Clang’s OpenMP implementation for NVIDIA GPUs currently doesn’t support multiple GPU architectures in a single binary. This means that you have to know the target GPU when compiling an OpenMP application. Additionally Clang needs compatible runtime libraries for every architecture that you’ll want to use in the future.

So first of all you need to gather a list of GPU models that you are going to run on and map them to a list of architectures. A clearly structured table can be found on Wikpedia or in NVIDIA’s developer documentation. As an example, the “Tesla P100” has compute capability 6.0 while the more recent Volta GPU “Tesla V100” is listed with 7.0.

1. Install Prerequisites

Building LLVM requires some software:

First you’ll need some standard tools like make, tar, and xz. If you don’t have them installed, please consult your distribution’s instructions on how to get them.
For the build process a compiler already needs to be installed. Most Linux systems default to the GNU Compiler Collection (gcc). Please ensure that you have at least version 4.8 or refer to some online tutorials on how to install one for your system. If you happen to have an older installation of Clang, any version greater than version 3.1 should be fine.
Additionally LLVM requires a (more or less) recent CMake, at least version 3.4.3. If your distribution doesn’t provide an adequate version, see https://cmake.org/ on how to get it.
For the runtime libraries the system needs both libelf and its development headers.
Last but not least, you’ll need the CUDA toolkit by NVIDIA. However the latest CUDA 10.0 is not yet compatible with Clang 7.0. I’d recommend using version 9.2. This release also has support for Volta GPUs which may already be found in some HPC systems.

2. Download and Extract Sources

The LLVM project consists of multiple components. For the purpose of this post, you need at least the LLVM Core libraries, Clang and the OpenMP project. Download their tarballs from https://releases.llvm.org/:

 $ wget https://releases.llvm.org/7.0.0/llvm-7.0.0.src.tar.xz
 $ wget https://releases.llvm.org/7.0.0/cfe-7.0.0.src.tar.xz
 $ wget https://releases.llvm.org/7.0.0/openmp-7.0.0.src.tar.xz

You might also want to download and build compiler-rt:

 $ wget https://releases.llvm.org/7.0.0/compiler-rt-7.0.0.src.tar.xz

This will give you some runtime libraries that are required to use Clang’s sanitizers. A detailed explanation would go beyond the scope of this post, but I encourage everyone to take a look at the documentation of ASan, LSan, MSan, and TSan.

It’s highly recommended to verify the integrity of the downloaded archives. Each file has been signed by Hans Wennborg and you can find both his public key and .sig files next to the files you have just downloaded. As correctly verifying a gpg signature is a tricky business, I’m not going to explain it here (maybe in a follow-up post?).

The next step is to unpack the tarballs: (the last step may be skipped if you don’t want to build compiler-rt)

 $ tar xf llvm-7.0.0.src.tar.xz
 $ tar xf cfe-7.0.0.src.tar.xz
 $ tar xf openmp-7.0.0.src.tar.xz
 $ tar xf compiler-rt-7.0.0.src.tar.xz

This should leave you with 3 / 4 directories named llvm-7.0.0.src, cfe-7.0.0.src, openmp-7.0.0.src, and (optionally) compiler-rt-7.0.0.src. All these components can be built together if the directories are correctly nested:

 $ mv cfe-7.0.0.src llvm-7.0.0.src/tools/clang
 $ mv openmp-7.0.0.src llvm-7.0.0.src/projects/openmp
 $ mv compiler-rt-7.0.0.src llvm-7.0.0.src/projects/compiler-rt

Again the last step is optional if you are skipping compiler-rt.

3. Build the Compiler

With the sources in place let’s proceed to configure and build the compiler. Projects using CMake are usually built in a separate directory:

 $ mkdir build
 $ cd build

The next steps will be pretty IO-intensive, so it might be a good idea to put the build directory on a locally attached disk (or even an SSD).

Next CMake needs to generate Makefiles¹ which will eventually be used for compilation:

 $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$(pwd)/../install \
	-DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_60 \
	-DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60,70 ../llvm-7.0.0.src

The first two flags are standard for CMake projects: CMAKE_BUILD_TYPE=Release turns on optimizations and disables debug information. CMAKE_INSTALL_PREFIX specifies where the final binaries and libraries will be installed. Be sure to choose a permanent location if you are building in a temporary directory.

The other two options are related to the GPU architectures as mentioned above. CLANG_OPENMP_NVPTX_DEFAULT_ARCH sets the default architecture when not passing the value during compilation. You should adjust the default to match the environment you’ll be using most of the time. The architecture must be prefix with sm_, so Clang configured with the above command will build for the Tesla P100 by default.
LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES applies to the runtime libraries: It specifies a list of architectures that the libraries will be built for. As you cannot run on GPUs without a compatible runtime, you should pass all architectures you care about. Also, please note that the values are passed without the dot, so compute capability 7.0 becomes 70.

If everything went right you should see something like the following towards the end of the output:

-- Found LIBOMPTARGET_DEP_LIBELF: /usr/lib64/libelf.so
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.27.1") 
-- Found LIBOMPTARGET_DEP_LIBFFI: /usr/lib64/libffi.so
-- Found LIBOMPTARGET_DEP_CUDA_DRIVER: <<<REDACTED>>>/libcuda.so
-- LIBOMPTARGET: Building offloading runtime library libomptarget.
-- LIBOMPTARGET: Not building aarch64 offloading plugin: machine not found in the system.
-- LIBOMPTARGET: Building CUDA offloading plugin.
-- LIBOMPTARGET: Not building PPC64 offloading plugin: machine not found in the system.
-- LIBOMPTARGET: Not building PPC64le offloading plugin: machine not found in the system.
-- LIBOMPTARGET: Building x86_64 offloading plugin.
-- LIBOMPTARGET: Building CUDA offloading device RTL.

In this case the system also has libffi installed which allows building a plugin that offloads to the host (here: x86_64). This is mostly used for testing and not required for offloading to GPUs.

Now comes the time-consuming part:

 $ make -j8

Using the -j parameter (short for --jobs) you can allow make to run multiple commands concurrently. Usually the number of cores in your server is a reasonable choice which can speed up the compilation by a good deal.

Afterwards the built libraries and binaries need to be installed:

 $ make -j8 install

4. Rebuild the OpenMP Runtime Libraries with Clang

If you tried to compile an application with OpenMP offloading right now, Clang would print the following message:

clang-7: warning: No library ‘libomptarget-nvptx-sm_60.bc’ found in the default clang lib directory or in LIBRARY_PATH. Expect degraded performance due to no inlining of runtime functions on target devices. [-Wopenmp-target]

As you’d expect from a warning you can run perfectly fine without these “bitcode libraries”. However GPUs are meant as an accelerator so you want your application to run as fast as possible. To get the missing libraries you’ll need to recompile the OpenMP project, using Clang built in the previous step.

Instead of only rebuilding the OpenMP project, it’s also possible to repeat step 3 entirely. That’s usually referred to as “bootstrapping” because Clang is compiling its own source code. I usually prefer doing this when installing a released version of a compiler.
Anyway, I’ll explain building only the OpenMP runtime libraries which will get you the required files much faster.

To do so, first create a new build directory:

 $ cd ..
 $ mkdir build-openmp
 $ cd build-openmp

Now configure the project with CMake using the Clang compiler built in the previous step:

 $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$(pwd)/../install \
	-DCMAKE_C_COMPILER=$(pwd)/../install/bin/clang \
	-DCMAKE_CXX_COMPILER=$(pwd)/../install/bin/clang++ \
	-DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60,70 \
	../llvm-7.0.0.src/projects/openmp

The flags are the same as above except that we want to use a different compiler. With CMake this can be adjusted with CMAKE_C_COMPILER and CMAKE_CXX_COMPILER. If you installed the binaries to a different location, you need to adapt their values accordingly.

Build and install the OpenMP runtime libraries:

 $ make -j8
 $ make -j8 install

This should give you some libomptarget-nvptx-sm_??.bc libraries as mentioned in the warning message.

5. Use Compiler for OpenMP Applications

Following the instructions up to this point you should now have a fully working Clang compiler with support for OpenMP offloading! To use it, you’ll need to export some environment variables:

 $ cd ..
 $ export PATH=$(pwd)/install/bin:$PATH
 $ export LD_LIBRARY_PATH=$(pwd)/install/lib:$LD_LIBRARY_PATH

Afterwards you are good to compile an application that uses OpenMP offloading:

 $ clang -fopenmp -fopenmp-targets=nvptx64 -O2 application.c

This will use the default GPU architecture specified by CLANG_OPENMP_NVPTX_DEFAULT_ARCH in step 3. Alternatively you can override that choice by adding -Xopenmp-target -march=sm_70 to the invocation.

What’s next?

Depends on what you want to try! For a start you can read / watch / attend tutorials about how to use OpenMP offloading. The next step could be to start playing around and / or adding support for OpenMP offloading to an existing HPC application.

Some links if you are running into problems: The current release has some limitions as listed in the documentation. But if there is something broken that’s supposed to work, please file a bug in LLVM’s Bugzilla.

If you are now feeling more adventurous than using a released version of the compiler you can also try to compile the current trunk version. This should basically work the same as explained above, except that you are checking out the sources from Subversion.

CMake is also able to generate rules.ninja for the Ninja build system. It is “designed to run builds as fast as possible”, but this is not really paying off for building a release once. ↩

You do not need to agree with my opinions expressed in this blog post, and I'm fine with different views on certain topics. However, if there is a technical fault please send me a message so that I can correct it!