https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90993
Bug ID: 90993 Summary: simd integer division not optimized Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: kretz at kde dot org Target Milestone: --- Target: x86_64-*-*, i?86-*-* Test case (https://godbolt.org/z/CYipz7): template <class T> using V [[gnu::vector_size(16)]] = T; V<char > f(V<char > a, V<char > b) { return a / b; } V<short> f(V<short> a, V<short> b) { return a / b; } V<int > f(V<int > a, V<int > b) { return a / b; } V<unsigned char > f(V<unsigned char > a, V<unsigned char > b) { return a / b; } V<unsigned short> f(V<unsigned short> a, V<unsigned short> b) { return a / b; } V<unsigned int > f(V<unsigned int > a, V<unsigned int > b) { return a / b; } (You can extend the test case to 32 and 64 bit vectors.) All these divisions have no SIMD instruction on x86. However, conversion to float or double vectors is lossless (char & short -> float, int -> double) and enables implementation via divps/divpd. This leads to a considerable speedup (especially on divider throughput), even with the cost of the conversions. The division by 0 case is UB (http://eel.is/c++draft/expr.mul#4), so it doesn't matter that a potential SIGFPE turns into "whatever". ;-) For reference, this is the result of my library implementation: https://godbolt.org/z/Xgo9Pk. And benchmark results on Skylake i7: TYPE Latency Speedup Throughput Speedup [cycles/call] [cycles/call] schar, 24.5 9.81 schar, simd_abi::__sse 32.3 12.1 9.19 17.1 schar, vector_size(16) 128 3.06 125 1.26 schar, simd_abi::__avx 40.3 19.4 18.7 16.8 schar, vector_size(32) 255 3.07 256 1.23 -------------------------------------------------------------------------------- uchar, 20.8 7.55 uchar, simd_abi::__sse 31.9 10.4 9.5 12.7 uchar, vector_size(16) 121 2.74 116 1.04 uchar, simd_abi::__avx 39.9 16.7 18.8 12.8 uchar, vector_size(32) 230 2.9 224 1.08 -------------------------------------------------------------------------------- short, 22.7 6.4 short, simd_abi::__sse 23.6 7.7 4.52 11.3 short, vector_size(16) 62.6 2.91 58.4 0.877 short, simd_abi::__avx 30.6 11.9 9.55 10.7 short, vector_size(32) 120 3.03 114 0.9 -------------------------------------------------------------------------------- ushort, 19.4 7.37 ushort, simd_abi::__sse 23.7 6.55 4.55 12.9 ushort, vector_size(16) 61.3 2.53 57.4 1.03 ushort, simd_abi::__avx 30.6 10.1 8.86 13.3 ushort, vector_size(32) 116 2.67 114 1.03 -------------------------------------------------------------------------------- int, 23.2 7.14 int, simd_abi::__sse 24.7 3.75 7.24 3.95 int, vector_size(16) 40.3 2.3 30.9 0.924 int, simd_abi::__avx 35.6 5.22 14.5 3.95 int, vector_size(32) 64.2 2.9 61.4 0.93 -------------------------------------------------------------------------------- uint, 20.5 7.14 uint, simd_abi::__sse 44 1.86 7.73 3.69 uint, vector_size(16) 39.7 2.07 30.9 0.925 uint, simd_abi::__avx 56.9 2.89 16 3.57 uint, vector_size(32) 71.4 2.3 71.5 0.798 -------------------------------------------------------------------------------- I have not investigated whether the same optimization makes sense for other targets than x86. Since this optimization requires optimized vector conversions, PR85048 is relevant.