The prefetch instruction that is emitted by __builtin_prefetch is re-ordered on GCC, but not on clang[0]. GCC's behavior is surprising because when using the builtin you want the instruction to be placed at the exact point where you put it. Moving it around, specially across load/stores, may end up being a pessimization. Adding a blockage instruction before the prefetch prevents the scheduler from moving it.
[0] https://godbolt.org/z/Ycjr7Tq8b -- 8< -- diff --git a/gcc/builtins.cc b/gcc/builtins.cc index 37c7c98e5c..fec751e0d6 100644 --- a/gcc/builtins.cc +++ b/gcc/builtins.cc @@ -1329,7 +1329,12 @@ expand_builtin_prefetch (tree exp) create_integer_operand (&ops[1], INTVAL (op1)); create_integer_operand (&ops[2], INTVAL (op2)); if (maybe_expand_insn (targetm.code_for_prefetch, 3, ops)) - return; + { + /* Prevent the prefetch from being moved. */ + rtx_insn *last = get_last_insn (); + emit_insn_before (gen_blockage (), last); + return; + } } /* Don't do anything with direct references to volatile memory, but