https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206

--- Comment #3 from Uroš Bizjak <ubizjak at gmail dot com> ---
Here is the problem:

        vmovd   .LC1(%rip), %xmm4       # 21    [c=4 l=10]  *movv4qi_internal/4
        ...
        vpmovzxbw       %xmm4, %xmm4    # 22    [c=10 l=6] 
sse4_1_zero_extendv8qiv8hi2/2
        ...
        vpsrlvw %xmm1, %xmm4, %xmm1     # 24    [c=4 l=6]  avx512vl_lshrvv8hi
        ...
        vpmullw %xmm4, %xmm0, %xmm0     # 63    [c=4 l=4]  *mulv8hi3/1


.LC1:
        .byte   -52
        .byte   -52
        .byte   -52
        .byte   -52

The compiler loads .LC1 (actually { 204, 204, 204, 204 } into %xmm4. Please
note that this only has 4 QImode elements. Following that, it uses VPMOVZXBW
which extends 8 QImode elements to 8 HImode elements. VPSRLVW actually uses
only 4 QImode elements, so everything is OK here.

However, VPMULLW needs all 8 QImode elements, but %xmm4 only has 4 loaded; the
high 4 elements are zero. This effectively clears high four elements from the
multiplication result, and this is what the testcase detects.

Reply via email to