https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

--- Comment #13 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Haochen Jiang from comment #12)
> (In reply to Hongtao Liu from comment #11)
> > (In reply to Haochen Jiang from comment #10)
> > > A patch like Comment 8 could definitely solve the problem. But I need to
> > > test more benchmarks to see if there is surprise.
> > > 
> > > But, yes, as Uros said in Comment 9, maybe there is a chance we could do 
> > > it
> > > better.
> > 
> > Could you add "arch=skylake-avx512" to target_clones and try disable whole
> > ix86_expand_vecop_qihi2 to see if there's any performance improvement?
> > For x86, cross-lane permutation(truncation) is not very efficient(3-4 cycles
> > for both vpermq and vpmovwb).
> 
> When I disable/enable ix86_expand_vecop_qihi2 with arch=skylake-avx512 on
> trunk, there is no performance regression comparing to GCC13 + avx2.
> 
> It seems that the regression only happens when GCC14 + avx2.

This is what the patch in Comment #8 prevents. skylake-avx512 enables
TARGET_AVX512BW, so VPMOVB is emitted instead of problematic VPERMQ.

Reply via email to