https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Intrinsics being type-agnostic cause vector subregs to appear before register
allocation: the pseudo coming from the load has mode V2DI, the shift needs to
be done in mode V4SI, the bitwise-or and the store are done in mode V2DI again.
Subreg in the bitwise-or appears to be handled inefficiently. Didn't dig deeper
as to what happens during allocation.

FWIW, using generic vectors allows to avoid introducing such mismatches, and
indeed the variant coded with generic vectors does not have extra loads. For
your original code you'll have to convert between generic vectors and __m128i
to use the shuffle intrinsic. The last paragraphs in "Vector Extensions"
chapter [1] suggest using a union for that purpose in C; in C++ reinterpreting
via union is formally UB, so another approach could be used (probably simply
converting via assignment).

[1] https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

typedef uint32_t u32v4 __attribute__((vector_size(16)));
void gcc_double_load_128(int8_t *__restrict out, const int8_t *__restrict
input)
{
    u32v4 *vin = (u32v4 *)input;
    u32v4 *vout = (u32v4 *)out;
    for (unsigned i=0 ; i<1024; i+=16) {
        u32v4 in = *vin++;
        *vout++ = in | (in >> 4);
    }
}

Above code on Compiler Explorer: https://godbolt.org/z/MKPvxb

Reply via email to