Another NaN-based tagging strategy for dynamic programming languages

2023-01-11

One popular family of strategies for representing pointers⁰ in dynamically-typed languages is what I will refer to loosely as low-bits tagging schemes. In these schemes, a pointer is contained by a word, the low bits of which are metadata describing the upper bits. (If you’re unfamiliar, Max Bernstein exposits a representative such scheme here.)

One advantage of these schemes is that they admit efficient access to the object denoted by the pointer. Many basic operations can be performed directly on the tagged representation, rather than needing to separate data from metadata before operating on the former. This applies at least when the data are integers or memory addresses. What about floating-point numbers?

Double-precision floating-point numbers take up a whole word, on 64-bit architectures; that is no good.¹ But single-precision floating-point numbers take up only half a word, which is quite manageable. Standard is to store the floats in the high 32 bits of the pointer, with the low 32 bits as a tag.² Unlike with integers and memory addresses, it is not generally possible to operate on these directly, without untagging them first. Hence, suppose we want to compile a function which adds its two arguments, and requires that they be floating-point numbers; we might end up with assembly like the following (error-checking elided for brevity):

shr	rdi, 32		; shift off the tag bits of the first addend
shr	rsi, 32		; shift off the tag bits of the second addend
movd	xmm0, edi	; move the first addend from a general-purpose register into a floating-point register
movd	xmm1, esi	; ditto
addss	xmm0, xmm1	; perform the actual addition
movd	eax, xmm0	; move the sum back into a general-purpose register
shl	rax, 32		; shift the sum into the high bits of the general-purpose register
or	rax, SFLOAT_TAG	; tag the sum

This involves a lot of pointless faffery only for the sake of the tags. My insight is that we can elide a lot of this faffery by taking advantage of a couple of qualities near-ubiquitous on modern processors and architectures. Generally, the floating-point registers coincide with the vector registers; and furthermore, vector operations may have identical or near-identical performance to scalar operations, especially on floating-point numbers. Also, any floating-point operation performed on a NaN will give back the same NaN. Therefore, if we can make the bit pattern of the low 32 bits of a tagged single-precision float coincide with a single-precision NaN, we can perform a vector floating-point operation between any two tagged floating-point numbers, treating each as a 2-vector of floats; in the result, the high element of the vector—that is the high 32 bits—will be our desired result, while the low element—that is the low 32 bits—will be a NaN, carried through from the inputs, which also means that it is also a valid tag. The result is that we no longer need to explicitly tag or untag any float value. Here’s addition again, with this new scheme:

movq	xmm0, rdi	; move our first addend into a floating-point vector register
			; now the low element of the vector is the low 32 bits of our tagged input, which happen to be a NaN
			; and the following element is the high 32 bits of our tagged input, which are the number it denotes
			; (the register has room for two more elements; we don’t care about their values, but they happen to have been zeroed)

movq	xmm1, rsi	; ditto the second addend

addps	xmm0, xmm1	; perform a vector addition
			; the low element of the result is xmm0₀ + xmm1₀ = NaN + NaN = NaN
			; the next element is xmm0₁ + xmm1₁, which is the sum we’re interested in
			; (the next two elements, again, we don’t care about; they happen to be zeroes)

movq	rax, xmm0	; copy the result back out into a general-purpose register
			; the low 32 bits are the first element of the vector, which are a NaN, which is our tag for a single float
			; the next 32 bits are the second element of the vector, which is the sum we’re interested in.

It’s hard to imagine a more performant way of dealing with tagged floats!

Comparisons

How to compare numbers represented thus? On most architectures, scalar floating-point comparisons apply only to the low element of a given vector, whereas our numbers are in the second element of their vector. Not to worry: just interpret the low two elements of the vector as a scalar double-precision float and compare. For instance, the following x86 function snippet adds its first two arguments and compares the sum to the third argument:

movq	xmm0, rdi
movq	xmm1, rsi
movq	xmm2, rdx
addps	xmm0, xmm1
comisd	xmm0, xmm2

I think this is—not the smartest, but certainly the cleverest snippet of code I have ever written.

This does not handle NaNs, or comparisons between zeroes of opposite sign, but is otherwise sound (and comparisons including infinities are consistent).

(Other tricks are possible; for example, perform a vector subtract, and test the sign bit; or, if one of the comparands is a product, transform it into an fmsub and test the sign of the result (vtestpd selfie). But neither of these is faster than the approach given above, and both have more edge cases. If strict interpretation is required, and type inference does not rule out undesirable cases, fastest and best is simply to shuffle the vectors in question and perform a scalar single-precision compare; or, do a vector compare and branch on the result. In any case, adoption of my proposed tagging scheme should never harm performance, for it affords an implementation strictly more flexibility in code generation than the alternative.)

Choice of NaN

What NaN to choose for the tag? It should probably be a quiet NaN; those are denoted by the highest mantissa bit, so you still get all the low bits to play with. Beyond that, any considerations will likely be architecture-specific; I will offer a few for x86:

On x86, there are, I think, basically two interesting classes of NaNs.

The first is the positive NaNs. These are advantagoues compared with negative NaNs because they allow for a slightly faster tagging sequence, given an untagged single float. Something like the following:

movd	eax, xmm0
shl	rax, 32
or	rax, SFLOAT_TAG

This sequence only works if SFLOAT_TAG is a positive float. If it were a negative float then, interpreted as a sign-extended 32-bit integer, it would also be a negative integer; but in this case it is sign-extended to a full 64 bits, so it would wipe out the high 32 bits.

However, I suspect that choosing a negative NaN is preferable. To see why, consider the type-checking sequence; suppose we have a tagged pointer in rax, and would like to check if it a single float. We might do something like the following:

cmp	eax, SFLOAT_TAG
jne	fail

If SFLOAT_TAG is negative, and furthermore has most of the high bits of its mantissa set to 1, then it may be representable as a (sign-extended) 8-bit integer. (Whereas, a positive NaN will never be representable as a sign-extended 8-bit integer.) x86 has a compact encoding for 8-bit immediates; the code size improvements from using the smaller immediate format are likely substantial. In particular, type checks should outnumber tagging of untagged floats by a significant margin, since there is just not much reason to work with untagged floats under this scheme.

Notes

(back)
I use this word in a somewhat idiosyncratic—but nevertheless defensible—sense, which I will write more about at a later date. For now, suffice to say that a pointer is something which denotes an object; it may be realised—in whole or in part—by a memory address, but it needn’t be. What is important for our purposes is the separation between the pointer and the object it denotes.
(back)
Or perhaps not; there are a couple of tagging strategies for double-precision floats which may be of interest.

The first was proposed by John Cowan: it is to use traditional NaN-tagging, but store the exponent complemented (or offset). Within the NaN space, it will be possible to dereference memory addresses directly, and to perform arithmetic directly on at least positive integers, as the high bits of the pointer will be zero. Outside of the NaN space, an extra untagging step is necessary: you must complement (or offset) the exponent bits before using the float as a float; this costs ALU cycles, but likely beats a memory fetch.

The second is to use a low-bits tagging scheme, stealing a single bit from the space of double-precision floats. The mantissa is frequently all significant, but the exponent may not be. So rotate the bits of the double float so the most significant bit of the exponent is the least significant bit of the word. Since approximately half of all floats have a magnitude less than one, and half have a magnitude greater than one, it would be undesirable to steal either range; both are commonly useful. Instead, rotate the exponent’s bias so that the most significant quarter of all exponents and the least significant quarter both get the same tag (or tweak the ratio a bit; say, two thirds of the positive exponents and one third of the negative ones). The idea is to represent immediately the floats which are most likely to appear in real programs. Floats that are not immediately representable will have to be boxed up and put in memory. Left as an exercise for the reader is finding an efficient way to include zero but exclude denormals and other floats with very low exponents.
(back)
Why offset it by 32 bits? One reason this might be desirable is that, if a tagged floating-point number is stored in memory somewhere, and you know about it, you can load the floating-point part directly with an aligned 4-byte load.