On Wed, Jun 26, 2019 at 10:36 AM Richard Biener <rguent...@suse.de> wrote: > > On June 26, 2019 10:25:44 AM GMT+02:00, Uros Bizjak <ubiz...@gmail.com> wrote: > >On Wed, Jun 26, 2019 at 10:17 AM Uros Bizjak <ubiz...@gmail.com> wrote: > >> > >> Now that TARGET_MMX_WITH_SSE is implemented, the compiler should be > >> able to auto-vectorize: > > > >On a related note, following slightly changed testcase: > > > >void > >foo (char *restrict r, char *restrict a) > >{ > > for (int i = 0; i < 24; i++) > > r[i] += a[i]; > >} > > > >compiles to: > > > >foo: > > vmovdqu (%rdi), %xmm1 > > vpaddb (%rsi), %xmm1, %xmm0 > > movzbl 16(%rsi), %eax > > addb %al, 16(%rdi) > > vmovups %xmm0, (%rdi) > > movzbl 17(%rsi), %eax > > addb %al, 17(%rdi) > > movzbl 18(%rsi), %eax > > addb %al, 18(%rdi) > > movzbl 19(%rsi), %eax > > addb %al, 19(%rdi) > > movzbl 20(%rsi), %eax > > addb %al, 20(%rdi) > > movzbl 21(%rsi), %eax > > addb %al, 21(%rdi) > > movzbl 22(%rsi), %eax > > addb %al, 22(%rdi) > > movzbl 23(%rsi), %eax > > addb %al, 23(%rdi) > > ret > > > >One would expect that the remaining 8-byte array would also get > >vectorized, resulting in one 16-byte operation and one 8-byte > >operation. > > Try - - param vect-epilogue-nomask=1 (or so).
Yes, this (--param vect-epilogues-nomask=1) works! foo: movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm2 movq 16(%rsi), %xmm1 paddb %xmm2, %xmm0 movups %xmm0, (%rdi) movq 16(%rdi), %xmm0 paddb %xmm1, %xmm0 movq %xmm0, 16(%rdi) ret Thanks, Uros.