https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118416

            Bug ID: 118416
           Summary: std::experimental::simd code detecting all zero is not
                    optimized to simple ptest on x86-64 avx
           Product: gcc
           Version: 14.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: lee.imple at gmail dot com
  Target Milestone: ---

The following code uses experimental c++ standard library simd, and wants to
detect several all-zero patterns that can be easily done with the vptest
instructions. All the code is available at https://godbolt.org/z/Kx68E1T6v .

```c++
#include <experimental/simd>
#include <cstdint>
namespace stdx = std::experimental;

template <class T, std::size_t N>
using simd_of = stdx::simd<T, stdx::simd_abi::deduce_t<T, N>>;

using data_t = simd_of<std::int32_t, 4>;

bool simple_ptest(data_t x) {
    return all_of(x == 0);
}

bool ptest_and(data_t a, data_t b) {
    return all_of((a & b) == 0);
}

bool ptest_andn(data_t a, data_t b) {
    return all_of((a & ~b) == 0);
}
```

Equivalent assembly (hand-written):

```asm
simple_ptest:
        vptest  %xmm0, %xmm0
        sete    %al
        ret
ptest_and:
        vptest  %xmm0, %xmm1
        sete    %al
        ret
ptest_andn:
        vptest  %xmm0, %xmm1
        setc    %al
        ret
```

But g++ generates the following code at `-O3 -march=x86-64-v3`, and clang++ and
even Intel icpx generates almost the same assembly.

```asm
simple_ptest(std::experimental::parallelism_v2::simd<int,
std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16> >):
        vpxor   %xmm1, %xmm1, %xmm1
        vpcmpeqd        %xmm1, %xmm0, %xmm0
        vpcmpeqd        %xmm1, %xmm1, %xmm1
        vptest  %xmm1, %xmm0
        setc    %al
        ret
ptest_and(std::experimental::parallelism_v2::simd<int,
std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16> >,
std::experimental::parallelism_v2::simd<int,
std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16> >):
        vpand   %xmm1, %xmm0, %xmm0
        vpxor   %xmm1, %xmm1, %xmm1
        vpcmpeqd        %xmm1, %xmm0, %xmm0
        vpcmpeqd        %xmm1, %xmm1, %xmm1
        vptest  %xmm1, %xmm0
        setc    %al
        ret
ptest_andn(std::experimental::parallelism_v2::simd<int,
std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16> >,
std::experimental::parallelism_v2::simd<int,
std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16> >):
        vpandn  %xmm0, %xmm1, %xmm1
        vpxor   %xmm0, %xmm0, %xmm0
        vpcmpeqd        %xmm0, %xmm1, %xmm1
        vpcmpeqd        %xmm0, %xmm0, %xmm0
        vptest  %xmm0, %xmm1
        setc    %al
        ret
```

I don't know whether this should be a missed optimization in g++ or a libstdc++
issue. Since these compilers generate the same output from the same library
code, I guess probably this should be a library issue.

Possibly related: PR58790 ?

Reply via email to