CPUs running Intel’s Skylake-X microarchitecture have a curious bug that I haven’t seen mentioned anywhere: the AVX-512 compression instructions have a false dependency on their destination. In other words, the following two instructions have identical performance characteristics:
vcompressps X{k}, Y vcompressps X{k}{z}, Y
Whereas we would expect the latter to depend only on k and Y, it also depends on X. The problem seems to have been fixed in Icelake. Surprisingly, while it affects all compression operations, it does not affect any of the expansion operations. Presumably, this is related to the odd behaviour of compression with a memory destination; expansion can’t target memory.
One thing I have never understood is why compression and expansion operations pun on the operation mask, instead of using it for its ordinary purpose and using a separate mask to specify the lanes to be compressed from or expanded to. There is certainly space in the instruction encoding for it. And these operations are slow enough that it seems unlikely adding a mask would have slowed them down further. The semantics have to be figured out, but they can be figured out (I would just and the two masks together), and the result would be strictly more expressive.
My benchmarking code is here. On a system which has the bug, it should print two large numbers and two small numbers; on a system which does not have the bug, it should print one large number and three small numbers.
A Zen4 user has reported that Zen4 seems to have a false dependency on the destination for both compressions and expansions. We can only speculate as to why (it’s presumably completely unrelated), but it might have something to do with the aforementioned pun on the mask.
The benchmark has been updated: it now attempts to its own analysis of the measurements it performs. It also tests all supported compression and expansion functions; compile with -DVBMI2 or -march=native on a machine supporting that hardware (Icelake or later, or Zen4). I am interested to see results from Rocket Lake and Cannon Lake CPUs; if you have such a CPU and are interested, please feel free to email me the benchmark’s results.