[Bug tree-optimization/94962] New: Suboptimal AVX2 code for _mm256_zextsi128_si256(_mm_set1_epi8(-1))

2020-05-05 Thread n...@self-evident.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94962

Bug ID: 94962
   Summary: Suboptimal AVX2 code for
_mm256_zextsi128_si256(_mm_set1_epi8(-1))
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: n...@self-evident.org
  Target Milestone: ---

Background: https://stackoverflow.com/q/61601902/

GCC emits an unnecessary "vmovdqa xmm0,xmm0" for the following code:

 __m256i mask()
{
return _mm256_zextsi128_si256(_mm_set1_epi8(-1));
}

Live example on godbolt: https://gcc.godbolt.org/z/PbsQDR

I have found no way to avoid this except by resorting to inline asm.

[Bug target/94962] Suboptimal AVX2 code for _mm256_zextsi128_si256(_mm_set1_epi8(-1))

2020-05-18 Thread n...@self-evident.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94962

--- Comment #5 from Nemo  ---
(In reply to Jakub Jelinek from comment #2)

I would be happy if GCC could just emit optimal code (single vcmpeqd
instruction) for this useful constant:

_mm256_set_m128i(_mm_setzero_si128(), _mm_set1_epi8(-1))

aka.

_mm256_inserti128_si256(_mm256_setzero_si256(), _mm_set1_epi8(-1), 0)


(The latter is just what GCC uses to implement _mm256_zextsi128_si256, if I am
reading the headers correctly.)

It's a minor thing, but I was a little surprised to find that none of the
compilers I know of are able to do this. At least, not with any input I tried.

[Bug target/94962] Suboptimal AVX2 code for _mm256_zextsi128_si256(_mm_set1_epi8(-1))

2020-05-19 Thread n...@self-evident.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94962

--- Comment #7 from Nemo  ---
(In reply to Hongtao.liu from comment #6)
>
> vmovdqa xmm0, xmm0 is not redundant here, it would clear up 128-256 bit
> which is the meaning of `zext`.

No, it is redundant because "vpcmpeqd xmm0, xmm0, xmm0" already zeroes out the
high lane of ymm0.