If you google for ‘AVX2 unsigned compare’ or similar, you will find no end of Stack Overflow answers telling you to simply add 128 (or however much—depending on the sizes at hand) to each of your inputs before comparing. But if you are so unfortunate as to be stuck without AVX-512, and you are in a hostile environment, you will have to figure out how to load up a vector of 128s, which costs cycles. Fortunately, there is an alternative.
AVX2 and SSE do have a few instructions for working on unsigned numbers; the important ones are: PMAXUB, PMINUB, PSUBUSB (and the analogous instructions on other sizes—this does not work for all sizes). Comparison can be implemented using subtraction, obviously; and given:
VPSUBUSB z, x, y
There will in z be left a zero wherever y was greater than or equal to x, and some nonzero value (the difference) otherwise. However, in most contexts, you’ll want a mask, those are found only in the most significant bit of each lane, so an additional comparison with zero will be needed to do anything useful (a notable exception is PTEST), which in turn requires that a register be zeroed, which increases frontend and register pressure. To avoid this, use min:
VPMINUB z, x, y VPCMPEQB w, z, x
This will leave in w a true wherever x was equal to the min of x and y—that is, wherever x was less than or equal to y—and everywhere else a false. Comparing with y instead of with x and swapping out min for max give analogous results.
I wonder if anyone working on superoptimisation has caught these? They’re very short instruction sequences, but you might not have thought to look for them if you didn’t need them.