https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206
--- Comment #3 from Uroš Bizjak <ubizjak at gmail dot com> --- Here is the problem: vmovd .LC1(%rip), %xmm4 # 21 [c=4 l=10] *movv4qi_internal/4 ... vpmovzxbw %xmm4, %xmm4 # 22 [c=10 l=6] sse4_1_zero_extendv8qiv8hi2/2 ... vpsrlvw %xmm1, %xmm4, %xmm1 # 24 [c=4 l=6] avx512vl_lshrvv8hi ... vpmullw %xmm4, %xmm0, %xmm0 # 63 [c=4 l=4] *mulv8hi3/1 .LC1: .byte -52 .byte -52 .byte -52 .byte -52 The compiler loads .LC1 (actually { 204, 204, 204, 204 } into %xmm4. Please note that this only has 4 QImode elements. Following that, it uses VPMOVZXBW which extends 8 QImode elements to 8 HImode elements. VPSRLVW actually uses only 4 QImode elements, so everything is OK here. However, VPMULLW needs all 8 QImode elements, but %xmm4 only has 4 loaded; the high 4 elements are zero. This effectively clears high four elements from the multiplication result, and this is what the testcase detects.