(At least, I thought it was pretty cute.)
2023-01-27Here, I’ll propose an interesting way to remove overhead from certain concurrent data structures on CPUs with hardware transactional memory (HTM). I doubt it’s original, but I haven’t seen it discussed anywhere (and would appreciate any references to prior art). I also don’t have any benchmarks (yet); I’ll discuss this more later, but I think it’s interesting regardless.
I will also assume x86, as that is the only mainstream architecture I know of that includes transactions.
Well—we have HTM; what do we need to bother with locks for? Just stuff everything in a transaction!
If only it were that easy. Transactions are not free; in particular, they impose ordering constraints that threaten (intra-thread, instruction-level) parallelism, and they also require extra bookkeeping to track dirtied lines. In a silly microbenchmark on my Skylake-X CPU, I found that a transaction on an uncontended cache line takes approximately 5x the time of an uncontended atomic, and even uncontended atomics are far from free. Not at all terrible—if it helps you scale!—but, at the same time, not something to be undertaken lightly.
The title gives it away; what I am after is a sort of biased lock. A structure is accessed frequently by one thread, and infrequently by other threads; the main thread should be able to access it as quickly as possible, but off-threads may eat additional overhead. As a case study, let’s consider a concurrent bitmap. We should be able to atomically perform simple updates to individual bits; we may or may not care about inter-bit consistency, causality, or ABA (etc.—although it so happens that we provide full transactional atomicity, just like a regular lock), but it’s vitally important that no operations be lost.
This could be implemented in an unbiased fashion, if all threads agreed to use transactions or atomic, strongly-ordered operations,0 but, again, we would like to avoid synchronisation overhead for the main thread.
What if the main thread performed no synchronisation, and off-threads acted atomically? That obviously breaks; consider the following example, in which our initial state is 00, the main thread would like to set the left bit, and an off-thread would like to set the right bit:
Main thread | Off-thread |
---|---|
Read 00 | |
Atomically RMW 00→01 | |
Update 00→10 | |
Store 01→10 |
The write from the off-thread was lost.
In fact, the only case when the above protocol breaks is when an off-thread performs an update in between the time when the main thread reads and when it writes its updated value (like in the example). In any other case, the operations of the main thread will appear atomic to off-threads. And interactions between off-threads will of course be fine, as off-threads always synchronise. Therefore, it suffices to ensure that the off-threads never attempt to modify the structure while the main thread is in this critical section; we can do this by making the main thread take out a lock. A cheap lock, that is!—it should be cheaper than a synchronised instruction.
It would seem that the lock can be implemented very straightforwardly, since it can only ever be held by one thread. We’ll associate a single lock word with the structure; when it’s 1, that means the structure is free to be modified by any thread (in other words: the lock is free); when it’s 0, that means the main thread is currently modifying the structure (in other words: the lock is held). To operate on the bitmap from the main thread, we’ll say:
write 0 to the lock word to acquire it load the appropriate byte of the bitmap twiddle the appropriate bit of the byte write the byte back to the bitmap write 1 to the lock word to release it
This should be nice and fast, as we impose no strong synchronisation requirements. To operate on the bitmap from an off-thread, we’ll say:
START transaction ensure that the lock word is 1 (on failure, do...whatever it is threads do when they encounter contention) load the appropriate byte twiddle the appropriate bit write the byte back END transaction
So does this scheme work? Unfortunately, no. The x86 memory model does enforce a total order on writes, but it allows reads to be reordered before non-aliasing writes that precede them in program order. This means that the CPU can reorder our main-thread sequence as follows:
load the appropriate byte of the bitmap write 0 to the lock word to acquire it ...
This should not be an unexpected problem to have. After all, the entire reason for using this alternate locking strategy is that it allows the CPU more freedom to reorder! So, we need some way to ensure that our bitmap-twiddling operations can be contained entirely by the lock and unlock, while still allowing other operations to be reordered.
Because writes have a total order, the final write to the bitmap will always happen before the write that releases the lock, and after the write that acquires it; so we don’t have to worry about that. We only have to ensure that our initial read is sequenced after the write that acquires the lock. The way to accomplish that is with a data dependency. (Look up ‘consume memory order’ for more information on why this is interesting and why it works.) We obtain the following sequence, for the main thread:
write 0 to the lock word to acquire it dummy ← load lock word (this is always 0) addr′ ← addr + dummy (this is always the same as addr) load a byte from addr′ twiddle the appropriate bit in the byte write back to addr′ write 1 to the lock word to release it
Here, the load has a data dependency on the computation of addr′ (and therefore happens after it); the computation of addr′ has a data dependency on the load of dummy (and therefore happens after it); and the load of dummy aliases the store to the lock word (and therefore happens after it); so the read must happen after the lock acquisition. Therefore, the revised sequence does satisfy the needed invariants.
Here are those sequences again as (untested) x86 function snippets:
; in: rdi: a pointer to a bitmap. rdi points to the start of the bitmap; rdi-64 points to the lock word ; in: rsi: an index of a bit to be tested and set ; out: carry flag: 1 iff the bit was previously 1 ; in other words, both subroutines are equivalent to bts [rdi], rsi bts_mainthread: mov dword [rdi - 64], 0 ; acquire lock mov edx, dword [rdi - 64] ; create data dependency on the lock bts [rdi + rdx], rsi ; test and set mov dword [rdi - 64], 1 ; release lock ret bts_offthread: xbegin .fail ; start transaction test dword [rdi - 64], 1 ; ensure lock not held jz .abort ; bail if it is bts [rdi], rsi ; test and set xend ; end transaction ret .abort: xabort .fail: ...
(Note: see erratum.)
I have intentionally avoided providing any hard performance numbers here. The easiest thing to do would be to perform some dumb microbenchmarks, but I think this would be intellectually dishonest. The performance characteristics of modern hardware are very squishy. It would be very easy to construct a benchmark in which the overhead required by this approach, small though it be, overhshadowed the advantages. And vice versa. But neither of these would necessarily be representative of any real-world workload. Furthermore, it’s not even clear what the best thing is to compare this to; myriad and diverse approaches are possible, but the details always depend on the specifics of the problem at hand. The advantage here might be not even performance, but simplicity, as another approach might have similar (or even slightly better) performance, but require absurd contortions or entail bizarre states (as is typical of hardcore concurrent algorithms).
It’s my belief that use of this method will be advantageous for some real-world problems and workloads. I plan to test this out on one application, and may report back numbers here once I have done. But developing real applications takes time. Until then, I think reporting microbenchmark results would be misleading and cause more harm than good.
As I alluded to (and handwaved somewhat), the off-threads are not wait-free and need a fallback strategy. Furthermore, transactions are not guaranteed to make progress! The fallback strategy cannot simply be to retry indefinitely.
The standard recommended fallback strategy for HTM is to use an ordinary mutex; that doesn’t work in this case, because the main thread uses neither a normal mutex nor a transaction. IPI tricks might be feasible, since there is only one main thread,1 but that is very complicated and invasive, for such a simple thing.
Really, there is no reason why such a small and simple transaction should not (globally) make progress eventually. For more complex transactions, it makes complete sense that you might not always be able to make progress, and require a locking fallback path. But I would like to see CPU vendors delineate a class of simple transactions which are required to always make (global) progress.2 There are undoubtedly complexities to be dealt with here. (The general problem is at least as hard as lock graph management, which has not been solved. To say nothing of the finicky cache management problems.) Nevertheless, I believe they can be dealt with, and would appreciate if anyone at least tried.
Hardware transactional memory? Seriously? What are you smoking??
HTM is absent on a lot of popular hardware today, but I believe that it is the future, and we should prepare for it. Decades ago, Cliff Click found that HTM was good for interesting performance gains on Azul’s 768-core boxes; only, taking advantage of it required rewriting previously lock-based code, something Azul’s clients were not wont to do. Today, core counts on mainstream machines are catching up, and single cores aren’t getting much faster; fine-grained synchronisation mechanisms are more important than ever.3 Today’s performance-sensitive users are willing to rewrite their code, if the gains are manifest, which they are.
Even just a few years ago, Intel’s TSX was good for an appreciable performance boost (on much smaller machines), where it was supported in hardware; interest has waned only because their implementation in Haswell was buggy, and they removed it entirely from Icelake. But their upcoming server microarchitecture, Sapphire Rapids, restores support for it. AMD and ARM have both specified (but not yet, to my knowledge, released public implementations of) transactional memory, demonstrating more interest in the feature. We should expect to see a lot more HTM in the next 10-20 years.
Concurrency is hard! As it turns out, the revised code doesn’t work either. Thanks to shachaf for pointing out the problem and helping me work through some potential solutions.
The root of the problem is store forwarding, which is not just an implementation detail, but an explicit part of the x86 memory model. While a load must happen ‘after’ a store which aliases it, it can in fact happen before the store is made visible to other cores (and, in particular, before the cache line containing the store target is claimed exclusively), which completely breaks our invariants.
How to fix it? It’s at this point that the whole issue becomes even more subtle, finicky, and cute (somehow). In practice, we can prevent store forwarding by making a load which partly overlaps the store, but also refers to other data. The other data must be fetched from cache; when this happens, the CPU stalls the load until the store is completed. Hence, the following sequence obtains:
bts_mainthread: mov dword [rdi - 64], 0 mov rdx, qword [rdi - 64] bts [rdi + rdx], rsi mov dword [rdi - 64], 1 ret
Note we perform a 32-bit store and a 64-bit load; the load will stall until the store is completed, maintaining our invariants.
...or will it? Although this behaviour can be observed on every CPU I know of, the Intel SDM is frustratingly vague, and it’s not clear to me what guarantees (if any) are provided. It devotes but a single sentence to store forwarding, saying:
The processor-ordering model described in this section is virtually identical to that used by the Pentium and Intel486 processors. The only enhancements in the Pentium 4, Intel Xeon, and P6 family processors are:
- Added support for speculative reads, while still adhering to the ordering principles above.
- Store-buffer forwarding, when a read passes a write to the same memory location.
There is also an example (§8.2.3.5), but it is unilluminating, and examples are not generally taken to be normative anyway.
One argument holds that a 64-bit load is notionally a pair of 32-bit loads, executed atomically. Each 32-bit load, independently, can be reordered before the store: one load matches the store exactly, and is therefore eligible for forwarding; the other does not alias the store at all, and so can be freely moved. Having moved each of these 32-bit loads to the other side of the store, the CPU is allowed to recombine them and execute them atomically. I’m not sure if this is sound reasoning, though; can you really take an atomic operation, break it up into its constituents, perform nonatomic reasoning on them separately, and then recombine the pieces into an atom as if nothing happened? I believe the manual is ambiguous, and hope it will be amended in the future.
For completeness’s sake, it is worth sketching at a couple of alternative solutions.
One direction involves something like a linux membarrier (I can’t find a very good reference for these, but this article by Paul Khuong explains them pretty well in passing) to account for the gap between the lock acquisition and read. Since there is only one main thread, this could probably be accomplished more cheaply (and portably!) with a signal than a membarrier.
While this would work, it significantly harms the performance (and scalability!) of off-thread operations, completely changing the cost model of the lock. At this point, an alternate approach would probably be simpler, easier, and faster.
There is another avenue which might be workable. Suppose that partially-overlapping accesses are not an a priori solution to store forwarding. They might still be useful; all hope might not be lost.
The case we are concerned about is the one when a 32-bit store is in the store buffer; a 64-bit load is performed which overlaps it; the low 32 bits are served from the store buffer, and the high 32 bits from cache. Is this valid? The SDM has some verbiage about causality which leads to me believe that it can only be done speculatively. Suppose it turns out that another thread made a store to the high 32 bits of our word, and a third thread observed that; then, the other thread’s store happened before ours, so the load cannot observe a state in which our store has happened but the other has not. Therefore, the best the CPU can do is bet that no one will write anything to the high 32 bits between the time when we forward the low 32 bits and the time when we finally write them out. If it loses the bet, then it has to rewind time locally.
What’s the significance of this? We can force the CPU to lose the bet, at will, by performing writes from the off-threads. The protocol for the main thread looks something like: perform a 32-bit store; perform a 64-bit load; do a quick branch to check to see that no one wrote anything untoward to the high 32 bits. The protocol for an off-thread looks something like: perform a full 64-bit CAS.
I haven’t thought through the details of either of these alternatives (especially the latter). I’m not sure if they’re worthwhile (although the second one might be). However, I’m fairly confident in the first solution (just use mismatched sizes), even though it is not technically clear whether its efficacy is guaranteed by the architecture semantics; I plan to use it, and feel comfortable recommending that you do, too (well—unless there is another bug lurking...).