Matt Kulukundis gives a lovely talk on google’s hash table.
Of particular interest to us is his probing scheme: on modern hardware, it is possible to take advantage of SIMD to probe *k* buckets at a time, rather than just one, as would be necessary on a traditional scalar architecture.
Empirically, this is quite fast.

There is a problem, though: the *k* buckets that we load into our vector must be contiguous in memory.
The standard scalar strategy with hash tables is to generate a hash index *i* (reduced modulo *n*, the length of the table, H), following which we may freely access bucket H[*i*]; *i* will never be out of range, we have made sure of that.
If we do not find what we seek in H[*i*], then set *i* := *i*+1 (reduced, again, modulo *n*) and repeat.
A straightforward vector adaptation of this would have us consider buckets H[*i*] through H[*i*+*k*-1] all at once, and upon failure, set *i* := *i*+*k* (reduced modulo *n*) and repeat.
We have guaranteed that *i*<*n*, but not that *i*+*k*-1<*n*, so if we land near the end of the table, we may overread^{0}.

Kulukundis presents two solutions to this, in turn.

The first begins by ensuring that *n* is always a multiple of *k* (one easy way to accomplish this is to make *n* a power of two, assuming that *k* is a power of two, which it usually is).
Then, when we probe, on top of reducing *i* modulo *n*, we also round it so that it is a multiple of *k*^{1}.
Thus, H is logically divided into ‘metabuckets’ of length *k*; it contains a whole number of metabuckets, and probing always begins at the beginning of a metabucket, probing the entire metabucket and no more, so there is no danger of overreading or overwriting.

An obvious concern with this design is that it increases the average probe length, by creating ‘clumps’ at the beginning of each metabucket.
But it will not create any clumps of length greater than *k* (any clumps of length greater than *k* would have shown up anyway), and we probe *k* at a time, so the effect is not so deleterious.

The other solution is to reserve storage for *n*+*k*-1 buckets.
Logically, the table still contains only *n* elements; the newly added *k*-1 buckets mirror the first *k*-1.
Writes to the table are responsible for maintaining this mirroring, but reads do not have to do any additional bookkeeping, and can use the above naïve algorithm as-is.
Removing overhead from reads and adding it to writes is considered good (within reason), because associative data structures tend to get more reads than writes (notionally, most stores are not dead).

In my application, I cannot do with metabucketing^{2}, and would like to avoid requiring extra faffery for writes.
Here, then, is the great insight: we can get away with allocating *n*+*k*-1 buckets, as in Kulukundis’s second solution, but then follow the naïve algorithm, for both reads and writes, with *no* mirroring or additional bookkeeping.
Vector probing in this case will not consider the same sequence of buckets as would scalar probing on a table of size *n*+*k*-1; if, for instance, *i*=*n*-1, then buckets 0 through *k*-2 will never be considered.
But it matters not, for both reads and writes will consider the same sequence of buckets as each other, as the probe procedure is identical for both^{3}.

- (back)

You may sidestep this problem, on some architectures, if you make your table to span the entire address space. - (back)

Alternately, reduce it modulo*n*/*k*, and then multiply by*k*; or, multiply by*k*and then reduce modulo*n*. But rounding is attractive, even though it wastes bits, because we can round and reduce at the same time using a bitmask if*n*and*k*are both powers of two. - (back)

Actually, I can, and do, to reduce cache pollution, but my probe length is larger than my metabuckets. We can consider these cases uniformly, including the degenerate case when the metabucket size is 1. - (back)

Another way of thinking about this is that for a table of length*n*, we have no extraneous buckets, but instead reduce our indices modulo*n*-*k*+1.