https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116713
--- Comment #3 from pietro <pietro.gcc at sociotechnical dot xyz> --- It looks like it's a more general GCC issue. The prefetch gets moved on both x86_64 and aarch64 on GCC, but not on clang: https://godbolt.org/z/Ycjr7Tq8b > It looks like the problem can be "fixed" by inserting a > '__atomic_thread_fence (1);' before the '__builtin_prefetch', which kinda > makes sense. The thread fence doesn't fix the prefetch move on x86_64, but the empty "asm" trick does: https://godbolt.org/z/5G8qe4o1n