https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #20 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
Thanks, I'd be very happy if such a relatively clear implementation could make
it!

> branchfree code is always better.

Don't say it like that. Smart branching, making use of how static
branch-prediction works, can speed up code significantly. You don't want to
compute everything when 99.9% of the inputs need only a fraction of the work.

              TYPE                      Latency     Speedup     Throughput    
Speedup
                                  [cycles/call] [per value]  [cycles/call] [per
value]
 float, simd_abi::scalar                   48.1           1             17     
     1
 float, std::hypot                         43.3        1.11           12.3     
  1.39
 float, hypot3_scale                       31.7        1.52           22.3     
 0.764
 float, hypot3_exp                         83.9       0.574           84.5     
 0.201
--------------------------------------------------------------------------------------
              TYPE                      Latency     Speedup     Throughput    
Speedup
                                  [cycles/call] [per value]  [cycles/call] [per
value]
double, simd_abi::scalar                   54.7           1             15     
     1
double, std::hypot                         53.8        1.02             19     
  0.79
double, hypot3_scale                         44        1.24             24     
 0.625
double, hypot3_exp                         91.3       0.599             91     
 0.165

and with -ffast-math:

              TYPE                      Latency     Speedup     Throughput    
Speedup
                                  [cycles/call] [per value]  [cycles/call] [per
value]
 float, simd_abi::scalar                   48.9           1           9.15     
     1
 float, std::hypot                         53.2       0.918           8.31     
   1.1
 float, hypot3_scale                       31.3        1.56             14     
 0.652
 float, hypot3_exp                         55.9       0.874           21.5     
 0.425
--------------------------------------------------------------------------------------
              TYPE                      Latency     Speedup     Throughput    
Speedup
                                  [cycles/call] [per value]  [cycles/call] [per
value]
double, simd_abi::scalar                   54.8           1           9.07     
     1
double, std::hypot                         61.5       0.891           11.3     
 0.805
double, hypot3_scale                       40.8        1.34           12.1     
 0.753
double, hypot3_exp                         64.2       0.853           23.3     
  0.39


I have not tested correctness or precision yet. Also, the benchmark only uses
inputs that do not require anything else than √x²+y²+z² (which, I believe,
should be the common input and thus optimized for).

Reply via email to