https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776
--- Comment #20 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> --- Thanks, I'd be very happy if such a relatively clear implementation could make it! > branchfree code is always better. Don't say it like that. Smart branching, making use of how static branch-prediction works, can speed up code significantly. You don't want to compute everything when 99.9% of the inputs need only a fraction of the work. TYPE Latency Speedup Throughput Speedup [cycles/call] [per value] [cycles/call] [per value] float, simd_abi::scalar 48.1 1 17 1 float, std::hypot 43.3 1.11 12.3 1.39 float, hypot3_scale 31.7 1.52 22.3 0.764 float, hypot3_exp 83.9 0.574 84.5 0.201 -------------------------------------------------------------------------------------- TYPE Latency Speedup Throughput Speedup [cycles/call] [per value] [cycles/call] [per value] double, simd_abi::scalar 54.7 1 15 1 double, std::hypot 53.8 1.02 19 0.79 double, hypot3_scale 44 1.24 24 0.625 double, hypot3_exp 91.3 0.599 91 0.165 and with -ffast-math: TYPE Latency Speedup Throughput Speedup [cycles/call] [per value] [cycles/call] [per value] float, simd_abi::scalar 48.9 1 9.15 1 float, std::hypot 53.2 0.918 8.31 1.1 float, hypot3_scale 31.3 1.56 14 0.652 float, hypot3_exp 55.9 0.874 21.5 0.425 -------------------------------------------------------------------------------------- TYPE Latency Speedup Throughput Speedup [cycles/call] [per value] [cycles/call] [per value] double, simd_abi::scalar 54.8 1 9.07 1 double, std::hypot 61.5 0.891 11.3 0.805 double, hypot3_scale 40.8 1.34 12.1 0.753 double, hypot3_exp 64.2 0.853 23.3 0.39 I have not tested correctness or precision yet. Also, the benchmark only uses inputs that do not require anything else than √x²+y²+z² (which, I believe, should be the common input and thus optimized for).