Cute trick for fetch-and-add-based queues

2024-04-16

In a prototypical concurrent queue based on fetch-and-add (FAA), two concurrent enqueue operations will contend on two cache lines. First, they will contend on the queue head, and mediation will be required to determine who bumped it forward first. But, after claiming a spot in which to enqueue, you also have to actually install your message there. Two concurrent enqueues will claim adjacent spots, which will likely be on the same cache line, inducing spurious contention, except for the rare case where the two spots happen to straddle a cache line boundary.

(I’m going to assume a fairly typical configuration—64-byte cache lines, 8-byte queue elements—but the concepts are completely general; similarly, everything I say about producers applies equally well to consumers.)

As usual, we can avoid false sharing by padding, but that wastes quite a lot of space. Really, what we’d like is—let the first enqueuer put its message in the first word of the first cache line, and the second enqueuer put its message in the first word of the second cache line, but in a while, after the first enqueuer is probably no longer interested in the first cache line, we can come back to it and fill in the second word with something useful. When is ‘a while’? As late as possible: after we’ve run out of cache lines to fill in the first words of.

That is, where a normal FAA-based queue traverses its backing array in this order:

Word 0 of cache line 0
Word 1 of cache line 0
Word 2 of cache line 0
Word 3 of cache line 0
Word 4 of cache line 0
Word 5 of cache line 0
Word 6 of cache line 0
Word 7 of cache line 0
Word 0 of cache line 1
Word 1 of cache line 1
Word 2 of cache line 1
Word 3 of cache line 1
Word 4 of cache line 1
Word 5 of cache line 1
Word 6 of cache line 1
Word 7 of cache line 1
...

We want to go in this order instead:

Word 0 of cache line 0
Word 0 of cache line 1
Word 0 of cache line 2
Word 0 of cache line 3
...
Word 1 of cache line 0
Word 1 of cache line 1
Word 1 of cache line 2
Word 1 of cache line 3
...
Word 2 of cache line 0
Word 2 of cache line 1
Word 2 of cache line 2
Word 2 of cache line 3
...
Word 3 of cache line 0
Word 3 of cache line 1
Word 3 of cache line 2
Word 3 of cache line 3
...

But this is a transpose! We’re projecting a two-dimensional n×8 structure onto our backing array, and traversing it by columns first, rather than rows. This is a virtualisation of storage which is very easy to implement—if n is a power of two, which is not hard to arrange, then it is just shuffling bits around before indexing—and the rest of the queue algorithm should be fairly oblivious to the physical order of the queue elements, so this should be fairly easy to slot into any existing queue.

A few more notes.

I’ve described this in terms of specific offsets in specific cache lines, assuming that the backing storage is aligned, but it is actually completely alignment-oblivious, and so works without modification even e.g. in cases where a garbage collector might change the alignment of the storage at runtime.

On some CPUs with an adjacent line prefetcher, it might be appropriate to think of the queue storage as having a three-dimensional structure: n×2×8. In particular, suppose concurrent accesses to different lines in the same line-pair perform somewhat worse than concurrent accesses to different line-pairs, but better than concurrent accesses to the same line. Then, by choosing the right order of traversal with respect to this three-dimensional structure, we can keep the same distance between accesses to the same line, but also space out accesses to the same line-pair somewhat. Whereas, if we just increased the logical line size to 128 bytes, we would separate accesses to the same line-pair, but in return we would halve the time between accesses to the same line. (And, of course, if we kept the logical line size at 64 bytes, then every other access would touch the same line-pair as the previous.) This, too, works completely independently of the alignment of the queue storage.

The way I’ve presented this technique, we have to know the cache line size, but it’s worth noting that even if we slightly over- or under-estimate it, it is still helpful. We can take this a step further and devise a cache-oblivious traversal order, which is optimal for every cache line size (including nested structures like the above case of the adjacent line prefetcher). What is that order? Just reverse the order of the bits in the index! This isn’t particularly useful, but it is fun.

Obviously, this doesn’t solve the fundamental scaling problems of FAA-based queues. But it is a very cheap improvement, complexity-wise. I might write about how to scale in-order queues in the future, and there are well-known techniques for scaling out-of-order queues (whose guarantees suffice for many applications!).