Faster, denser, stronger, hopefully not bigger: uniform floats

2023-06-24

(As before, even though this desperately wants diagrams, I have not drawn them. Please accept my sincere apologies.)

Floating-point numbers are, by design, not evenly spaced. This makes somewhat non-obvious the issue of generating a uniform floating-point number in some range. The most common case, and the one I’ll treat here, is that of generating a number in the range [0 1). There are two classic approaches:

How is the second strategy defined? Usually with a bit of hand-waving (and this obscures the issue!). I present the following definition: we generate a 0.inf-bit fixedpoint number and round it to floating point. The rounding should be floor/chop; ceiling/away corresponds to a (0 1] range (which might be what you want!), and nearest is, uhhhhhh, just don’t go there.

I recommend reading the following for background:

Marc and Allen provide reasonable slow paths for this sort of generator, so I will punt to them; what I want to provide is a very fast fast path (with 1/4096 probability of punting to the slow path). Marc’s final implementation comes close to mine, but falls slightly short. We can do better!

Observe that, in generating our infinity-bit fixedpoint number, we can stop as soon as we have 53 significant bits, since truncating rounding will ignore all less significant bits. Most RNGs produce 64 bits at a time, so we can start with a 64-bit uniform value, check that it has at least 53 significant bits; if it does, convert to a float, and finally rescale its range appropriately. The following snippet obtains:

;; out: xmm0, a uniform float in [0 1)
;; scratch: rax
;; pseudo-op: RAND, places a uniform 64-bit integer in its destination
RAND		rax			; get entropy
vcvtusi2sd	xmm0, rax, {rd-sae}	; convert.  AVX-512 makes everything better!
vmulsd		xmm0, xmm0, 0x1p-640	; scale.  Always exact; could also subtract from the exponent field
lzcnt		rax, rax		; range check
cmp		eax, 11			; if we can hoist a constant load, then this could just be a compare-and-branch1
ja		.slow

Not too shabby. Now, about that AVX-512 requirement...

Most sensible architectures do include functionality to convert an unsigned 64-bit integer to a float, so that part poses no problem; pre-AVX512 x86 does not, but a fairly straightforward 63-bit adaptation is possible, so long as you’re willing to resign yourself to a massive 1/2048 probability of taking the slow path. More insidious is the lack of static rounding control. The default rounding mode is generally nearest, ties to even. With that rounding mode, a naive adaptation of the above routine would result in a completely messed up distribution. There would be a small chance of producing a result of 1, and even results would be slightly likelier than odd ones (since, for example, 1.5 and 2.5 both round to 2).2 We can avoid the ties easily by unconditionally setting the low bit (loses another bit—we’re down to 1/1024 slow-path on classic x86, for those keeping count, and 1/2048 on other architectures). The ‘nearest’ part is harder to cope with.

Notionally, we can convert a nearest rounding into a flooring one by composing it with a subtraction of 0.5ulp. There are two problems with this. The first is that we don’t a priori know what an ulp is—our goal was to make the hardware do this for us. The second is that 0.5ulp is sometimes the bit that we hardwired to 1 to avoid needing to tie, and subtracting that away would be bad, so for this strategy to work, we would have to steal another bit, and 1/512 on x86 is looking like a very thin margin (especially if vectorised).

Let’s take a step back. The original snippet worked because a single uniform 64-bit integer provides enough information to give a result 4095/4096 of the time, and simply rounding is enough to get a result, with the correct distribution. We rejected the 1/4096 of cases where we don’t have enough information to give a result right away, punting to a slow path. When we replace flooring with nearest rounding, the most obvious thing that goes wrong is that we can sometimes end up with a result of 1. What if we rejected those cases too, punting to a slow path for them as well? Would the distribution for the remaining numbers be correct?

Intuitively, with flooring rounding, every floating-point number we generated had a 1ulp interval above it, which rounds to it. After switching to nearest rounding, nearly every floating-point number has an interval extending 0.5ulp above and 0.5ulp below, so the distribution should stay very close.3 But there is one exception: boundary conditions. For integer powers of two, the nearest rounding interval extends 0.5ulp above, but only 0.25ulp below, so integer powers of two are less likely to be generated than they should be!

(Exercise for the reader: why does it work this way? We shrank the rounding interval for powers of two, but left everything else the same—where did those probabilities go? Solve it before continuing if you solve it at all!)

But—is this actually a problem? Not at all. The answer to above exercise: those probabilities were the casualties when we shrank the probability of taking the fast path. All this means is that those powers of two need to be appropriately represented in the slow path; we don’t have to deal with them in the fast path. Furthermore, if we shave a little extra off of the range, we obtain a routine that works in any rounding mode, with no changes to the fast path; we do have to query the rounding mode in the slow path. Hence, the following snippet obtains (this time in AArgh64, for good measure):

;; out: d0, a uniform float in [0 1)
;; scratch: x0, x1, d1
RAND		x0		; get entropy
orr		x0, x0, 1	; ensure odd
ucvtf		d0, x0		; convert to float.  Cursory result, but schedule it early
mov		x1, (-64)<<52	; -64 in the exponent field
fmov		d1, x1
add		d0, d0, d1	; twiddle exponent.  Equivalent to multiply by 0x1p-64, but classier
cmp		x0, (-1)<<11	; bigger than nextbefore(1)?
b.hi		.slow1		; if so, then we could round to 1 in some modes, so slow path
clz		x0, x0		; now check if too small
cmp		x0, 10		; only get 10 bits of buffer up top now, because we stole the low bit
b.hi		.slow2

Here that is in C, for good measure:

double uniform_random(void) {
	uint64_t initial = 1 | rng_next();
	double ret = initial * 0x1p-64;
	if (clz(initial) <= 10
	    && initial <= ((uint64_t)-1) << 11) {
		return ret;
	}
	// slow path ...
}

Please don’t use any of this code. It is cute and clever, but utterly useless. And the slow path is not nearly so cute (which is why I didn’t write it).

Notes

  1. (back)
    Taken to be a memory operand for brevity; of course, you could hoist this load.

  2. (back)
    Incidentally, an alternate formulation of the same range check that skips the lzcnt and doesn't need a constant load hoisted is just: shr rax, 53; jnz .slow. But that won’t fuse.

  3. (back)
    With a true infinite-precision fixedpoint, there is no ‘tie’; you continue generating bits until you get a tiebreaker, as the probability of having a true tie is 0. Obviously, that is inconsequential here. If it were possible to cheaply detect a tie, that might be preferable, since it’s off the critical path, but I cannot see how to do it.

  4. (back)
    Since we’ve forced ties to up, these really do sum to ‘1ulp’ when discretised.