https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115693
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Blocks| |53947 Last reconfirmed| |2024-06-28 Target| |x86_64-*-* Ever confirmed|0 |1 CC| |crazylht at gmail dot com --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Xi Ruoyao from comment #1) > I'm transferring it to tree-optimization as the following cases are compiled > to stupid code: > > char a[8], b[8]; > > int test() > { > for (int i = 0; i < 8; i++) > if (a[i] != b[i]) > return 0; > > return 1; > } > > int test1() > { > int ret = 0; > for (int i = 0; i < 8; i++) > ret = ret || a[i] != b[i]; > > return ret; > } > > So it makes more sense to fix this in the optimization passes, instead of > ad-hoc hack in libstdc++. > > But I'm not sure if there already exists a dup. Let's keep this bug for the above testcase(s). For test() the issue is that even with SSE4.1 we don't seem to support ptest for V8QImode? For test1 cost modeling makes vectorization worthwhile, though with just SSE2 we get test1: .LFB1: .cfi_startproc movq a(%rip), %xmm1 pxor %xmm2, %xmm2 movq b(%rip), %xmm0 pcmpeqb %xmm1, %xmm0 movq .LC0(%rip), %xmm1 pandn %xmm1, %xmm0 movdqa %xmm0, %xmm1 punpcklbw %xmm2, %xmm0 punpcklbw %xmm2, %xmm1 pshufd $78, %xmm0, %xmm0 pxor %xmm2, %xmm2 movdqa %xmm0, %xmm3 punpcklwd %xmm2, %xmm0 punpcklwd %xmm2, %xmm3 pshufd $78, %xmm0, %xmm0 por %xmm3, %xmm0 movdqa %xmm1, %xmm3 punpcklwd %xmm2, %xmm1 punpcklwd %xmm2, %xmm3 pshufd $78, %xmm1, %xmm1 por %xmm3, %xmm1 por %xmm1, %xmm0 movdqa %xmm0, %xmm1 psrlq $32, %xmm1 por %xmm1, %xmm0 movd %xmm0, %eax ret with SSE4.2 it's a bit better and "just" test1: .LFB1: .cfi_startproc movq a(%rip), %xmm1 movq b(%rip), %xmm0 pcmpeqb %xmm1, %xmm0 movq .LC0(%rip), %xmm1 pandn %xmm1, %xmm0 pmovzxbw %xmm0, %xmm2 psrlq $32, %xmm0 pmovzxbw %xmm0, %xmm0 pmovzxwd %xmm0, %xmm1 psrlq $32, %xmm0 pmovzxwd %xmm0, %xmm0 por %xmm1, %xmm0 pmovzxwd %xmm2, %xmm1 psrlq $32, %xmm2 pmovzxwd %xmm2, %xmm2 por %xmm2, %xmm1 por %xmm1, %xmm0 movdqa %xmm0, %xmm1 psrlq $32, %xmm1 por %xmm1, %xmm0 movd %xmm0, %eax ret but we fail to realize that the bitwise-OR reduction could be narrowed to char: <bb 3> [local count: 954449106]: # ret_12 = PHI <iftmp.0_5(7), 0(15)> # i_14 = PHI <i_7(7), 0(15)> # ivtmp_4 = PHI <ivtmp_3(7), 8(15)> _1 = a[i_14]; _2 = b[i_14]; _8 = _1 != _2; _9 = (int) _8; iftmp.0_5 = _9 | ret_12; i_7 = i_14 + 1; ivtmp_3 = ivtmp_4 - 1; if (ivtmp_3 != 0) goto <bb 7>; [87.50%] instead we keep 4 V2SImode "accumulators" and widen the compare results. The best would be if scalar opts would make this a bool reduction though IIRC we have a PR for that being not handled. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations