https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118416
Bug ID: 118416 Summary: std::experimental::simd code detecting all zero is not optimized to simple ptest on x86-64 avx Product: gcc Version: 14.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: lee.imple at gmail dot com Target Milestone: --- The following code uses experimental c++ standard library simd, and wants to detect several all-zero patterns that can be easily done with the vptest instructions. All the code is available at https://godbolt.org/z/Kx68E1T6v . ```c++ #include <experimental/simd> #include <cstdint> namespace stdx = std::experimental; template <class T, std::size_t N> using simd_of = stdx::simd<T, stdx::simd_abi::deduce_t<T, N>>; using data_t = simd_of<std::int32_t, 4>; bool simple_ptest(data_t x) { return all_of(x == 0); } bool ptest_and(data_t a, data_t b) { return all_of((a & b) == 0); } bool ptest_andn(data_t a, data_t b) { return all_of((a & ~b) == 0); } ``` Equivalent assembly (hand-written): ```asm simple_ptest: vptest %xmm0, %xmm0 sete %al ret ptest_and: vptest %xmm0, %xmm1 sete %al ret ptest_andn: vptest %xmm0, %xmm1 setc %al ret ``` But g++ generates the following code at `-O3 -march=x86-64-v3`, and clang++ and even Intel icpx generates almost the same assembly. ```asm simple_ptest(std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16> >): vpxor %xmm1, %xmm1, %xmm1 vpcmpeqd %xmm1, %xmm0, %xmm0 vpcmpeqd %xmm1, %xmm1, %xmm1 vptest %xmm1, %xmm0 setc %al ret ptest_and(std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16> >, std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16> >): vpand %xmm1, %xmm0, %xmm0 vpxor %xmm1, %xmm1, %xmm1 vpcmpeqd %xmm1, %xmm0, %xmm0 vpcmpeqd %xmm1, %xmm1, %xmm1 vptest %xmm1, %xmm0 setc %al ret ptest_andn(std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16> >, std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16> >): vpandn %xmm0, %xmm1, %xmm1 vpxor %xmm0, %xmm0, %xmm0 vpcmpeqd %xmm0, %xmm1, %xmm1 vpcmpeqd %xmm0, %xmm0, %xmm0 vptest %xmm0, %xmm1 setc %al ret ``` I don't know whether this should be a missed optimization in g++ or a libstdc++ issue. Since these compilers generate the same output from the same library code, I guess probably this should be a library issue. Possibly related: PR58790 ?