Zen 3's Amazing Slow Short Rep Mov


In 2019 Intel released a new CPU feature called FSRM - Fast Short Rep Mov - as part of their Ice Lake architecture.
FSRM signals that rep movsb should be used even for small moves and will perform well. Previous processors' rep movsb implementations were only any good on large data.

Intel® 64 and IA-32 Architectures Optimization Reference Manual Volume 1

3.7.6.1 Fast Short REP MOVSB
Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long. Support for fast-short REP MOVSB is enumerated by the CPUID feature flag: CPUID [EAX=7H, ECX=0H).EDX.FAST_SHORT_REP_MOVSB[bit 4] = 1. There is no change in the REP STOS performance Table 3-6. Effect of Address Misalignment on Memcpy() Performance

glibc (and others, including the Linux kernel) started checking for FSRM and deciding to use rep movsb instructions for memmove, including for unaligned data. On Intel these operations are specified to perform somewhat worse for unaligned data, but not terribly.

Address Misalignment Performance Impact
Source Buffer The impact on Enhanced REP MOVSB and STOSB implementation versus 128-bit AVX is similar. Destination Buffer The impact on Enhanced REP MOVSB and STOSB implementation can be 25% degradation, while 128-bit AVX implementation of memcpy may degrade only 5%, relative to 16-byte aligned scenario.

AMD started setting the same Fast Short Rep Mov flag with their Zen 3 (Ryzen 5xxx) CPUs but did not test it on unaligned data.
rep movsb with unaligned data on a Zen 3 CPU sees a 20x slowdown! glibc Bug 30994.

FSRM:
  Size (bytes)   Destination Alignment      Throughput (GB/s)
  2113                               0                84.2448              
  2113                              15                 4.4310
  524287                             0                57.1122 
  524287                            15                4.34671

Vectorized:
  Size (bytes)   Destination Alignment      Throughput (GB/s)
  2113                               0               124.1830             
  2113                              15               121.8720
  524287                             0                58.3212 
  524287                            15                58.5352 

A patch set for glibc to avoid using rep movsb again is under development.

I hope AMD are able to release a microcode update which fixes the behavior of rep movsb on unaligned data or unsets the FSRM flag. Without a fix, the FSRM CPUID flag is misleading on AMD processors and any code deciding to use rep movsb will be wrong if it does the obvious thing.


tagged performance