In 2019 Intel released a new CPU feature called FSRM - Fast Short Rep Mov - as part of their Ice Lake architecture.
FSRM signals that rep movsb
should be used even for small moves and will perform well. Previous processors' rep movsb
implementations were only any good on large data.
Intel's optimization manual1 states:
3.7.6.1 Fast Short REP MOVSB
Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long. Support for fast-short REP MOVSB is enumerated by the CPUID feature flag: CPUID [EAX=7H, ECX=0H).EDX.FAST_SHORT_REP_MOVSB[bit 4] = 1. There is no change in the REP STOS performance Table 3-6. Effect of Address Misalignment on Memcpy() Performance
glibc (and others, including the Linux kernel) started checking for FSRM and deciding to use rep movsb
instructions for memmove, including for unaligned data. On Intel these operations are specified to perform somewhat worse for unaligned data, but not terribly.
Address Misalignment Performance Impact
Source Buffer The impact on Enhanced REP MOVSB and STOSB implementation versus 128-bit AVX is similar. Destination Buffer The impact on Enhanced REP MOVSB and STOSB implementation can be 25% degradation, while 128-bit AVX implementation of memcpy may degrade only 5%, relative to 16-byte aligned scenario.
AMD started setting the same Fast Short Rep Mov flag with their Zen 3 (Ryzen 5xxx) CPUs but did not test it on unaligned data.
rep movsb
with unaligned data on a Zen 3 CPU sees a 20x slowdown!2
FSRM:
Size (bytes) Destination Alignment Throughput (GB/s)
2113 0 84.2448
2113 15 4.4310
524287 0 57.1122
524287 15 4.34671
Vectorized:
Size (bytes) Destination Alignment Throughput (GB/s)
2113 0 124.1830
2113 15 121.8720
524287 0 58.3212
524287 15 58.5352
A patch set for glibc to avoid using rep movsb again3 is under development.
I hope AMD are able to release a microcode update which fixes the behavior of rep movsb
on unaligned data or unsets the FSRM flag. Without a fix, the FSRM CPUID4 flag is misleading on AMD processors and any code deciding to use rep movsb
will be wrong if it does the obvious thing.