Shaving cycles: faster bucketing

2023-01-22

Paul Khuong describes a method for performing what he calls ‘linear-log bucketing’. I’ll give a brief treatment in context of memory allocation, since that’s the context where I encountered it, and then present a marginally faster procedure for performing it.

A common strategy for malloc(3)-style allocators is to distinguish a number of sizes and devote a region of memory to allocations of each size. When you ask to allocate n bytes of memory, the allocator rounds it up to n′, the next supported size, and further computes i, a serial number associated with n′.

Which sizes should be supported? One easy answer is to include one for every power of two; the following therefore straightforwardly obtain:

i := ⌈log2n
n′ := 2i

But this wastes space; 25% of it, on average.

Linear-log bucketing is an alternative that begins by splitting up according to powers of two, but then further divides up every power of two into k (linearly-spaced) buckets. (In this case, I’ll consider k=4, but any power of two will work.)

Another way to think about this is that our i becomes a floating-point number with log2k significant bits of significand. This satisfies our original requirement that i be a serial number (in other words, larger values of i correspond to larger values of n′; or, i is a monotonic function of n).

But do we really need this property? In an allocator, i will probably simply be used to index an array of memory regions. We would like to avoid having holes in this array, so i should be dense over the range of n for which allocations are served in this way. (It might also be nice to efficiently map from i to n′, since the former can generally be stored in fewer bits.) Aside from that, it’s not helpful to require that i have any particular properties. (It needn’t even be near zero, as any sensible ISA will let us offset cheaply when we dereference.)

This is significant because a non-monotonic procedure turns out to be slightly cheaper to implement on many CPUs. Many CPUs have a ‘count leading zeroes’ instruction, which is monotonically decreasing in the magnitude of its argument. We therefore obtain the following procedure, which produces a somewhat zig-zaggy i:

; input:  edi as n, must be <232 and >0
; output: eax as i
lzcnt	ecx, edi		; compute the ‘exponent’ part of the result
shlx	esi, edi, ecx		; left-align n in its word; now the MSB is always 1
add	rsi, 0x1fffffff		; round it up; now the low 29 bits are meaningless, and the next two are the ones we care about
				; 64-bit operand is required to account for carry-out
shr	rsi, 29			; shift off the extraneous bits
				; (on a different ISA, this might be a 32-bit add followed by a 32-bit shift with carry)
lea	rax, [4*rcx + rsi - 4]	; compute result.  -4 is there purely for cosmetic reasons; notionally, it wipes out the (usually set) third bit of rsi, but really that just changes how much you have to offset by when indexing later

This shaves a single instruction (neg ecx) off the critical path. Worth it? You decide.

The subtlety with the carry is worth pointing out here. If our input is 1.111, that should round to 10.0; or, rather, should add one to the exponent, and round to 1.00. You can verify that this is in fact what happens.

The approach to optimisation used here is quite general, reminiscent of my faster hash table probing or Lemire’s fast alternative to the modulo reduction: instead of making a faster implementation of the same function, make a faster implementation of a different function, which satisfies the same requirements at a high level. This is an argument in favour of abstraction: when I provide you with a very high-level, abstract interface, I have a lot of freedom in implementing it. If I spill my guts to you, those become part of my interface, and tie me down forever to implementation of a potentially pessimal algorithm. We can see this in languages like C: while it is possible, given care, to produce performant software in C, the language does not encourage it, and performance is fragile because compilers have limited capacity to optimise.