https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90993

            Bug ID: 90993
           Summary: simd integer division not optimized
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: kretz at kde dot org
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

Test case (https://godbolt.org/z/CYipz7):

template <class T> using V [[gnu::vector_size(16)]] = T;

V<char > f(V<char > a, V<char > b) { return a / b; }
V<short> f(V<short> a, V<short> b) { return a / b; }
V<int  > f(V<int  > a, V<int  > b) { return a / b; }
V<unsigned char > f(V<unsigned char > a, V<unsigned char > b) { return a / b; }
V<unsigned short> f(V<unsigned short> a, V<unsigned short> b) { return a / b; }
V<unsigned int  > f(V<unsigned int  > a, V<unsigned int  > b) { return a / b; }

(You can extend the test case to 32 and 64 bit vectors.)

All these divisions have no SIMD instruction on x86. However, conversion to
float or double vectors is lossless (char & short -> float, int -> double) and
enables implementation via divps/divpd. This leads to a considerable speedup
(especially on divider throughput), even with the cost of the conversions. The
division by 0 case is UB (http://eel.is/c++draft/expr.mul#4), so it doesn't
matter that a potential SIGFPE turns into "whatever". ;-)

For reference, this is the result of my library implementation:
https://godbolt.org/z/Xgo9Pk.

And benchmark results on Skylake i7:
                  TYPE            Latency     Speedup     Throughput    
Speedup
                            [cycles/call]              [cycles/call]
 schar,                              24.5                       9.81
 schar, simd_abi::__sse              32.3        12.1           9.19       
17.1
 schar, vector_size(16)               128        3.06            125       
1.26
 schar, simd_abi::__avx              40.3        19.4           18.7       
16.8
 schar, vector_size(32)               255        3.07            256       
1.23
--------------------------------------------------------------------------------
 uchar,                              20.8                       7.55
 uchar, simd_abi::__sse              31.9        10.4            9.5       
12.7
 uchar, vector_size(16)               121        2.74            116       
1.04
 uchar, simd_abi::__avx              39.9        16.7           18.8       
12.8
 uchar, vector_size(32)               230         2.9            224       
1.08
--------------------------------------------------------------------------------
 short,                              22.7                        6.4
 short, simd_abi::__sse              23.6         7.7           4.52       
11.3
 short, vector_size(16)              62.6        2.91           58.4      
0.877
 short, simd_abi::__avx              30.6        11.9           9.55       
10.7
 short, vector_size(32)               120        3.03            114        
0.9
--------------------------------------------------------------------------------
ushort,                              19.4                       7.37
ushort, simd_abi::__sse              23.7        6.55           4.55       
12.9
ushort, vector_size(16)              61.3        2.53           57.4       
1.03
ushort, simd_abi::__avx              30.6        10.1           8.86       
13.3
ushort, vector_size(32)               116        2.67            114       
1.03
--------------------------------------------------------------------------------
   int,                              23.2                       7.14
   int, simd_abi::__sse              24.7        3.75           7.24       
3.95
   int, vector_size(16)              40.3         2.3           30.9      
0.924
   int, simd_abi::__avx              35.6        5.22           14.5       
3.95
   int, vector_size(32)              64.2         2.9           61.4       
0.93
--------------------------------------------------------------------------------
  uint,                              20.5                       7.14
  uint, simd_abi::__sse                44        1.86           7.73       
3.69
  uint, vector_size(16)              39.7        2.07           30.9      
0.925
  uint, simd_abi::__avx              56.9        2.89             16       
3.57
  uint, vector_size(32)              71.4         2.3           71.5      
0.798
--------------------------------------------------------------------------------

I have not investigated whether the same optimization makes sense for other
targets than x86.

Since this optimization requires optimized vector conversions, PR85048 is
relevant.

Reply via email to