Current processors nowaydays have mutliple cores. To make efficient use of their computing power one option is so-called “Shared Memory” parallelization. This usually means employing multiple threads that have access to the same “shared” memory. In that area OpenMP is a de-facto standard that uses compiler directives. Clang 3.7 introduced support for OpenMP 3.1 on the host.
However some of today’s most powerful systems in the world have a heterogeneous architecture.
Their compute power largely comes from accelerators, for example GPUs.
This requires different approaches than parallelization on the host, one of which is referred to as “offloading”:
Execution starts on the host and designated parts are sent to the attached device.
As one programming model OpenMP added
target directives in its version 4.0 in 2013.
Clang 7.0, released in September 2018, has support for offloading to NVIDIA GPUs. In this blog post I’m going to explain how to build the Clang compiler on Linux.
0. Determine GPU Architectures
Clang’s OpenMP implementation for NVIDIA GPUs currently doesn’t support multiple GPU architectures in a single binary. This means that you have to know the target GPU when compiling an OpenMP application. Additionally Clang needs compatible runtime libraries for every architecture that you’ll want to use in the future.
So first of all you need to gather a list of GPU models that you are going to run on and map them to a list of architectures. A clearly structured table can be found on Wikpedia or in NVIDIA’s developer documentation. As an example, the “Tesla P100” has compute capability 6.0 while the more recent Volta GPU “Tesla V100” is listed with 7.0.
1. Install Prerequisites
Building LLVM requires some software:
- First you’ll need some standard tools like
xz. If you don’t have them installed, please consult your distribution’s instructions on how to get them.
- For the build process a compiler already needs to be installed. Most Linux systems default to the GNU Compiler Collection (gcc). Please ensure that you have at least version 4.8 or refer to some online tutorials on how to install one for your system. If you happen to have an older installation of Clang, any version greater than version 3.1 should be fine.
- Additionally LLVM requires a (more or less) recent CMake, at least version 3.4.3. If your distribution doesn’t provide an adequate version, see https://cmake.org/ on how to get it.
- For the runtime libraries the system needs both
libelfand its development headers.
- Last but not least, you’ll need the CUDA toolkit by NVIDIA. However the latest CUDA 10.0 is not yet compatible with Clang 7.0. I’d recommend using version 9.2. This release also has support for Volta GPUs which may already be found in some HPC systems.
2. Download and Extract Sources
The LLVM project consists of multiple components. For the purpose of this post, you need at least the LLVM Core libraries, Clang and the OpenMP project. Download their tarballs from https://releases.llvm.org/:
You might also want to download and build
This will give you some runtime libraries that are required to use Clang’s sanitizers. A detailed explanation would go beyond the scope of this post, but I encourage everyone to take a look at the documentation of ASan, LSan, MSan, and TSan.
It’s highly recommended to verify the integrity of the downloaded archives.
Each file has been signed by Hans Wennborg and you can find both his public key and
.sig files next to the files you have just downloaded.
As correctly verifying a
gpg signature is a tricky business, I’m not going to explain it here (maybe in a follow-up post?).
The next step is to unpack the tarballs: (the last step may be skipped if you don’t want to build
This should leave you with 3 / 4 directories named
openmp-7.0.0.src, and (optionally)
All these components can be built together if the directories are correctly nested:
Again the last step is optional if you are skipping
3. Build the Compiler
With the sources in place let’s proceed to configure and build the compiler. Projects using CMake are usually built in a separate directory:
The next steps will be pretty IO-intensive, so it might be a good idea to put the build directory on a locally attached disk (or even an SSD).
Next CMake needs to generate
Makefiles1 which will eventually be used for compilation:
The first two flags are standard for CMake projects:
CMAKE_BUILD_TYPE=Release turns on optimizations and disables debug information.
CMAKE_INSTALL_PREFIX specifies where the final binaries and libraries will be installed.
Be sure to choose a permament location if you are building in a temporary directory.
The other two options are related to the GPU architectures as mentioned above.
CLANG_OPENMP_NVPTX_DEFAULT_ARCH sets the default architecture when not passing the value during compilation.
You should adjust the default to match the environment you’ll be using most of the time.
The architecture must be prefix with
sm_, so Clang configured with the above command will build for the Tesla P100 by default.
LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES applies to the runtime libraries:
It specifies a list of architectures that the libraries will be built for.
As you cannot run on GPUs without a compatible runtime, you should pass all architectures you care about.
Also, please note that the values are passed without the dot, so compute capability 7.0 becomes
If everything went right you should see something like the following towards the end of the output:
In this case the system also has
libffi installed which allows building a plugin that offloads to the host (here:
This is mostly used for testing and not required for offloading to GPUs.
Now comes the time-consuming part:
-j parameter (short for
--jobs) you can allow
make to run multiple commands concurrently.
Usually the number of cores in your server is a resonable choice which can speed up the compilation by a good deal.
Afterwards the built libraries and binaries need to be installed:
4. Rebuild the OpenMP Runtime Libraries with Clang
If you tried to compile an application with OpenMP offloading right now, Clang would print the following message:
clang-7: warning: No library ‘libomptarget-nvptx-sm_60.bc’ found in the default clang lib directory or in LIBRARY_PATH. Expect degraded performance due to no inlining of runtime functions on target devices. [-Wopenmp-target]
As you’d expect from a warning you can run perfectly fine without these “bitcode libraries”. However GPUs are meant as an accelerator so you want your application to run as fast as possible. To get the missing libraries you’ll need to recompile the OpenMP project, using Clang built in the previous step.
Instead of only rebuilding the OpenMP project, it’s also possible to repeat step 3 entirely.
That’s usually referred to as “bootstrapping” because Clang is compiling its own source code.
I usually prefer doing this when installing a released version of a compiler.
Anyway, I’ll explain building only the OpenMP runtime libraries which will get you the required files much faster.
To do so, first create a new build directory:
Now configure the project with CMake using the Clang compiler built in the previous step:
The flags are the same as above except that we want to use a different compiler.
With CMake this can be adjusted with
If you installed the binaries to a different location, you need to adapt their values accordingly.
Build and install the OpenMP runtime libraries:
This should give you some
libomptarget-nvptx-sm_??.bc libraries as mentioned in the warning message.
5. Use Compiler for OpenMP Applications
Following the instructions up this point you should now have a fully working Clang compiler with support for OpenMP offloading! To use it, you’ll need to export some environment variables:
Afterwards you are good to compile an application that uses OpenMP offloading:
This will use the default GPU architecture specified by
CLANG_OPENMP_NVPTX_DEFAULT_ARCH in step 3.
Alternatively you can override that choice by adding
-Xopenmp-target -march=sm_70 to the invocation.
Depends on what you want to try! For a start you can read / watch / attend tutorials about how to use OpenMP offloading. The next step could be to start playing around and / or adding support for OpenMP offloading to an existing HPC application.
Some links if you are running into problems: The current release has some limitions as listed in the documentation. But if there is something broken that’s supposed to work, please file a bug in LLVM’s Bugzilla.
If you are now feeling more adventurous than using a released version of the compiler you can also try to compile the current
This should basically work the same as explained above, except that you are checking out the sources from Subversion.
You do not need to agree with my opinions expressed in this blog post, and I'm fine with different views on certain topics. However, if there is a technical fault please send me a message so that I can correct it!