sched_yield() and the LLVM OpenMP runtime

End of last year, I worked on a bug report in LLVM’s bugzilla. The initial report said that the OpenMP test suite deadlocks for multiple people, quoting a downstream bug report filed against Gentoo’s sys-libs/libomp package. The problem was supposed to be reliably reproducible with a specific test (worksharing/for/kmp_sch_simd_guided.c). A provided log hinted to 32-bit builds which became my initial suspect: Most developers of the OpenMP runtime (Intel’s runtime team being a large part) are working on x86_64, so testing on i386 (or any i?86 if you prefer) is somewhat limited. In addition, there were multiple reports for 32-bit builds from the same developer that turned out be bugs in the code. In this case however, it was soon clarified that the problem affected 64-bit builds as well.

Ok, so based on my previous experience with said developer I never doubted that he was seeing the problem. But why did nobody else notice the problem if it affected x86_64? Neither continuous integration via LLVM’s buildbot nor the release testers reported problems. And developers run the test suite once in a while to make sure there are no regressions when working on new features, bug fixes, or performance improvements. So while the automatic testing could surely be improved (different compiler versions, all supported architectures), I’m pretty sure that the community would notice problems affecting all configurations.

The “Priority and Deadline based Skiplist multiple queue Scheduler” (PDS)

My next suspect was the operating system, more precisely the Linux distribution. After testing without success (well, without failures) on Arch Linux, CentOS, and Ubuntu I came to the conclusion that I needed a Gentoo system. Setting up a virtual machine with Gentoo was on my list for a long time anyway, so this was a good time to tackle this point. After one or two days (Gentoo is a source-based distribution, so you compile all packages locally on your machine), the system was up and running. But to my surprise (and disappointment?) the tests were passing on my Gentoo system in a reasonable time.

Fortunately, the Gentoo developer was more successful in finding the difference between the systems: After some time he noticed that the “deadlock” was only observed on systems with a small number of physical cores (dual-core Athlon64). However, some tests statically start more threads, up to four in the already mentioned test worksharing/for/kmp_sch_simd_guided.c. He was able to back this theory by limiting the available cores on a larger machine with taskset – which also resulted in “deadlocks”. We had already ruled out preemption before (it was activated on all systems we tested), but this fact took us into the right direction.

After a short sidestep into processor features1, the final suspect was the Linux kernel. In fact, the Gentoo developer wrote that he was using the PDS scheduler and switching to a different one “also reduced the CPU load considerably”. Before that mention I had never heard of that scheduler and always used the default scheduler that my distribution provided me. But the source code is on GitHub, so after a bit of research I soon got to blame a commit removing yield support. Later on this was confirmed to be the cause so that reverting the commit solves the problem. This was communicated to the developer of PDS who promised to look into the issue for future releases.

Barriers in the LLVM OpenMP runtime

So we found the difference between the systems and the component to “blame” (more on that below), good. We even figured out what has changed and know that reverting that change restores the old behavior, fine. But all of this doesn’t explain why removing support for sched_yield() affects the OpenMP runtime. Based on my previous knowledge of the code, I had an idea of what might be happening but some details were missing. And after all, I wanted to answer the most interesting question: Is it the scheduler’s “fault” or does libomp make wrong assumptions that PDS doesn’t meet (in its current form)?

To understand the answer, I need to explain some implementation details of the LLVM OpenMP runtime. First, the formal definition of a barrier from the latest OpenMP 4.5:

A point in the execution of a program encountered by a team of threads, beyond which no thread in the team may execute until all threads in the team have reached the barrier […].

In libomp all worker threads synchronize to the first thread (called master thread in OpenMP) which waits until all threads have arrived. Afterwards it wakes up all worker threads to resume execution behind the barrier.

In fact there are more sophisticated implementations that use “hierarchical” synchronization: Thread i synchronizes to thread j which synchronizes to the master thread (or even deeper chains). However, explaining these algorithms would probably be enough for a blog post on its own2. For only few threads we can assume that the worker threads synchronize directly to the master thread. And that’s where the story starts to get interesting – and sched_yield() comes into play.

sched_yield() is defined by POSIX and its description reads:

The sched_yield() function shall force the running thread to relinquish the processor until it again becomes the head of its thread list. It takes no arguments.

Section “Scheduling Policies” explains in its second paragraph:

There is, conceptually, one thread list for each priority. A runnable thread will be on the thread list for that thread’s priority.

The Linux man page of sched_yield() combines this into the following description:

sched_yield() causes the calling thread to relinquish the CPU. The thread is moved to the end of the queue for its static priority and a new thread gets to run.

The LLVM OpenMP runtime doesn’t change the priority of the threads it manages. This means that all threads are in the same list because they have the same priority. Consequently, libomp uses the following scheme when waiting for another thread:

while (!done) {
  sched_yield();
}

This tells the operating system that this thread has currently no work to do, hoping that another thread can start executing. In the best case, the CPU starts executing another thread of the same application which can possibly also arrive at the barrier. This will allow the whole applicatin to pass through the barrier and continue execution.

The result

So what happens if sched_yield() does nothing as currently implemented in PDS? A correct program won’t deadlock since preemption will suspend a thread after it has used its quantum and switch to a another one. But a quantum is 4ms by default on Linux (CONFIG_HZ = 250) which is a lot longer than calling sched_yield(). In the end, this means that barriers take significantly longer compared to schedulers that correctly implement sched_yield().

To showcase this statement, let’s look at an example and estimate a lower bound for the previously mentioned test case, worksharing/for/kmp_sch_simd_guided.c. Its main function runs a loop, varying the number of threads from 1 to 4. For simplicity, let’s only look at the maximum of 4 threads when running on 2 physical cores. In that case, you need to a wait at least 2 quanta for the necessary thread switches to happen by preemption for a single barrier3.

To get the number of barriers, I took the loop structure from the original test code: Only counting the loop iterations with 4 threads, there are a total of 295,488 barriers. Multiplied with 2 quanta per barrier and a default quantum of 4ms, this sums up to about 40 minutes! And this lower bound ignores the barriers with 3 threads that will also need preemption if sched_yield() isn’t supported. For comparison: When restricting this test to 2 CPUs on my Dell XPS 13, it passes with all checks in a bit more than one second…

Conclusion

Okay, time to wrap up the story and draw some conclusions: It all started with a bug report about deadlocks in the tests for the LLVM OpenMP runtime. This was traced down to the scheduler removing support for sched_yield() and doing nothing when called. In my opinion that’s not compliant with the functionality that the POSIX standard describes. In this case, the change might not be visible on a desktop system but the impact on the described implementation of a barrier is disastrous.

After all: Hey, there is a reason we have standards for all kind of things. They are supposed to provide functionality that applications and libraries can build on. For Unix-like operating systems, POSIX defines the basic syscalls – and they better work the way they are meant to!

  1. I was able to rule out “Intel vs. AMD” by testing on my old AMD Phenom II and setting up a VM emulating an older CPU. 

  2. If you are really interested, take a look at kmp_barrier.cpp. At the time of writing, there are linear, tree, hyper, and hierarchical

  3. Suppose two worker threads are active, encounter a barrier, and signal the master thread that they have arrived. After a quantum, the other two threads start executing, and the master thread sees that all threads have arrived and unblocks execution. The two worker threads that ran originally can only pass through the barrier after the next quantum expired, and there we go again. 

You do not need to agree with my opinions expressed in this blog post, and I'm fine with different views on certain topics. However, if there is a technical fault please send me a message so that I can correct it!