Futexes: a translation dictionary

Table of contents

Futexes are clearly the best low-level synchronisation mechanism. But although futexes are available on most popular platforms, the specifics vary greatly, both in terms of what operations are exposed and in how they are exposed. This page aims to establish equivalences, to make it easy to implement and port concurrency runtimes.

The baseline level of supported features includes:

Some interesting features are supported on many, but not all platforms:

macOS Linux FreeBSD Windows OpenBSD
64-bit futexes X X X
Wake n X X X
Inter-process futexes X X X X

Unix commonalities

A few things are common to all unices described here (but not to Windows):

macOS

The organisation here will largely be cheatsheet-style, with pointers to further documentation if you need to go deeper, but I will spend a bit longer on macOS, because the interface is undocumented, and I had to read through the XNU source and do my own experiments to find much of this information; hence, I will try to make this a comprehensive, authoritative source on macOS futexes, maintaining it until such time as Apple publishes their own official documentation. Despite the fact that these interfaces are not officially documented, I would not worry too much about their going away or changing incompatibly, considering that they are used by LLVM’s implementation of the C++ standard library.

The relevant routines are provided by libSystem, but you must declare the prototypes and constants yourself:

int __ulock_wait(uint32_t operation, void *addr, uint64_t value, uint32_t timeout_us);
int __ulock_wait2(uint32_t operation, void *addr, uint64_t value, uint64_t timeout_ns, uint64_t value2);
int __ulock_wake(uint32_t operation, void *addr, uint64_t wake_value);

// operation bits [7, 0] contain the operation code.
#define UL_COMPARE_AND_WAIT		1
#define UL_UNFAIR_LOCK			2
#define UL_COMPARE_AND_WAIT_SHARED	3
#define UL_UNFAIR_LOCK64_SHARED		4
#define UL_COMPARE_AND_WAIT64		5
#define UL_COMPARE_AND_WAIT64_SHARED	6

// operation bits [15, 8] contain the flags for __ulock_wake
#define ULF_WAKE_ALL			0x00000100
#define ULF_WAKE_THREAD			0x00000200
#define ULF_WAKE_ALLOW_NON_OWNER	0x00000400

// operation bits [23, 16] contain the flags for __ulock_wait
#define ULF_WAIT_WORKQ_DATA_CONTENTION	0x00010000
#define ULF_WAIT_CANCEL_POINT		0x00020000
#define ULF_WAIT_ADAPTIVE_SPIN		0x00040000

// operation bits [31, 24] contain the generic flags, which can be used with both __ulock_wait and __ulock_wake
#define ULF_NO_ERRNO			0x01000000

__ulock_wait2 is only available as of macOS 11 and later; coincidentally, this is also the first release that supported ARM CPUs, so for ARM targets, you may assume its existence. A timeout of 0 means the futex will never time out.

The operation codes containing 64 take addr to point to a 64-bit value; the remainder take it to point to a 32-bit value. The operation codes containing _SHARED must be used for a lock which is shared between multiple processes; their unshared counterparts can be used to improve performance when the lock is only used from one process. The operation code passed by a thread into __ulock_wait must match the code used by another thread to __ulock_wake it; for instance, you cannot put a thread to sleep with UL_COMPARE_AND_WAIT and wake it up with UL_COMPARE_AND_WAIT64 or UL_COMPARE_AND_WAIT_SHARED.

UL_COMPARE_AND_WAIT[64][_SHARED] straightforwardly implement the corresponding futex operations; UL_UNFAIR_LOCK is similar to UL_COMPARE_AND_WAIT, but is intended for implementing mutexes, telling the kernel the current holder of the mutex, which it can use to make more intelligent scheduling decisions. In particular, UL_UNFAIR_LOCK takes the high 30 bits of the value to be a thread identifier; all 32 are significant for comparison, so the low 2 can be used for userspace communication protocols. (UL_UNFAIR_LOCK64_SHARED does not seem to do anything, but presumably it was meant to be an interprocess alternative to UL_UNFAIR_LOCK.)

Apple’s pthreads implementation finds the thread id of the current thread thus; I don’t know if there is a better way or if this interface is stable:

#define __TSD_MACH_THREAD_SELF 3
#if defined(__x86_64__)
__attribute__((always_inline)) static inline void *_os_tsd_get_direct(unsigned long slot) {
	void *ret;
	__asm__("mov %%gs:%1, %0" : "=r" (ret) : "m" (*(void **)(slot * sizeof(void *))));
	return ret;
}
#else // __x86_64__
__attribute__((always_inline, const)) static inline void **_os_tsd_get_base(void) {
#if defined(__arm__)
	uintptr_t tsd;
	__asm__("mrc p15, 0, %0, c13, c0, 3\n"
	        "bic %0, %0, #0x3\n" : "=r" (tsd));
#elif defined(__arm64__)
	uint64_t tsd;
	__asm__ ("mrs %0, TPIDRRO_EL0" : "=r" (tsd));
#endif
	return (void**)(uintptr_t)tsd;
}
__attribute__((always_inline)) static __inline__ void *_os_tsd_get_direct(unsigned long slot) {
	return _os_tsd_get_base()[slot];
}
#endif
uint32_t get_threadid(void) {
        return (uint32_t)(uintptr_t)_os_tsd_get_direct(__TSD_MACH_THREAD_SELF); 
}

The low two bits of the result of get_threadid may be set, so you will have to mask them off if you want to use them as part of an UL_UNFAIR_LOCK-based protocol (indeed, I think perhaps they’re always set?).

Passing ULF_WAKE_ALL to __ulock_wake will wake all waiting threads (the default is just to wake one). Passing ULF_WAKE_THREAD will wake only the thread specified by its id in wake_value; if that thread is not currently waiting on the address, no one will be woken up. The thread id is the same as that returned by get_threadid above, but for ULF_WAKE_THREAD, the low two bits are significant and must be preserved. Passing ULF_WAKE_ALLOW_NON_OWNER must be done for UL_UNFAIR_LOCK if the waker was not the holder of the lock.

Passing ULF_WAIT_CANCEL_POINT to __ulock_wait marks the wait as a cancellation point; if this thread has been cancelled (as in pthread_cancel(3)), it may be terminated at that point. ULF_WAIT_ADAPTIVE_SPIN can be used together with UL_UNFAIR_LOCK as a hint to the kernel to spin for a bit on the lock (20µs, by default), if and only if the holder of lock is currently running. Regarding ULF_WAIT_WORKQ_DATA_CONTENTION I have nothing more intelligent to say than the kernel sources themselves:

The waiter is contending on this lock for synchronization around global data. This causes the workqueue subsystem to not create new threads to offset for waiters on this lock.

And:

The scheduler has a callback (sched_call) that some subsystems use to decide whether more threads should be thrown at a given problem by trying to maintain a good level of concurrency.

When the wait will not be helped by adding more threads (e.g. lock contention), using this flag will prevent the next wait/block to cause thread creation.

By default, __ulock_wait and __ulock_wake both return -1 on error and set errno; ULF_NO_ERRNO will make them return errors as negative error codes instead of setting errno. On success, __ulock_wait returns the number of other threads waiting on the same address, and __ulock_wake just returns 0.

Linux

Linux has a very expansive ‘futex’ interface, documented in futex(2), and treating it is out of scope; the most interesting feature is probably FUTEX_WAKE_BITSET.

#define FUTEX_WAIT 0
#define FUTEX_WAKE 1
#define FUTEX_PRIVATE_FLAG 128
syscall(SYS_futex, uint32_t *addr, FUTEX_WAIT, uint32_t value, struct timespec *timeout)
syscall(SYS_futex, uint32_t *addr, FUTEX_WAKE, int howmany)

The manual page says that howmany is an unsigned integer, but it is in fact signed; pass the greatest signed integer to wake all waiters.

The timeout is relative. Passing a timeout of NULL nominally means the futex will never time out; however, it has the secondary effect of preventing the futex from being interrupted by signals (this may or may not be a bug), so you should pass a large timeout instead of NULL if you want to be interruptable.

FUTEX_WAIT and FUTEX_WAKE can both be ored with FUTEX_PRIVATE_FLAG to get an intra-process futex.

FreeBSD

FreeBSD also has a relatively expansive interface, though much of it comprises trivial kernel-space implementations of pthreads synchronisation primitives.

#include <sys/umtx.h>
_umtx_op(uint64_t *addr, UMTX_OP_WAIT,      uint64_t value, (void*)sizeof(struct timespec), struct timespec *timeout)
_umtx_op(uint32_t *addr, UMTX_OP_WAIT_UINT, uint32_t value, (void*)sizeof(struct timespec), struct timespec *timeout)
_umtx_op(void *addr, UMTX_OP_WAKE, int howmany, void *dummy0, void *dummy1)

The timeout is relative. You can pass a size of 0 and a NULL timeout pointer to wait without timing out.

Use UMTX_OP_WAIT_UINT_PRIVATE and UMTX_OP_WAKE_PRIVATE for an intra-process futex; there are no 64-bit intra-process futexes.

On success, 0 is returned; on error, -1 is returned and errno is set.

Windows

Windows has a relatively spare interface, though it also has the dubious honour of being the only OS to implement 8- and 16-bit futexes (these can trivially be implemented in userspace).

#define INFINITE 0xffffffffu
WaitOnAddress(void *addr, const void *value, size_t size, uint32_t timeout_ms)
WakeByAddressSingle(void *addr)
WakeByAddressAll(void *addr)

size must be one of 1, 2, 4, 8, indicating the size in bytes of the value pointed to by addr; hence, value is a pointer to the value to be compared to.

Passing a timeout of INFINITE will wait without timing out.

Waiting returns true on success, and false on error; if the timeout elapsed, false will be returned, and GetLastError() will return ERROR_TIMEOUT.

Futexes are always intra-process on Windows.

OpenBSD

OpenBSD also has a fairly spare interface, though it at least includes inter-process futexes.

#include <sys/futex.h>
futex(uint32_t *addr, FUTEX_WAIT, uint32_t value, struct timespec *timeout, void *dummy)
futex(uint32_t *addr, FUTEX_WAKE, int howmany, void *dummy0, void *dummy1)

The timeout is relative. You can pass a NULL pointer to wait without timing out.

Use FUTEX_WAIT_PRIVATE and FUTEX_WAKE_PRIVATE for an intra-process futex.

FUTEX_WAKE returns the number of woken threads; FUTEX_WAIT returns 0 on success, and -1 on error, setting errno.