https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069
--- Comment #14 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- (In reply to Uroš Bizjak from comment #13) > (In reply to Haochen Jiang from comment #12) > > (In reply to Hongtao Liu from comment #11) > > > (In reply to Haochen Jiang from comment #10) > > > > A patch like Comment 8 could definitely solve the problem. But I need to > > > > test more benchmarks to see if there is surprise. > > > > > > > > But, yes, as Uros said in Comment 9, maybe there is a chance we could > > > > do it > > > > better. > > > > > > Could you add "arch=skylake-avx512" to target_clones and try disable whole > > > ix86_expand_vecop_qihi2 to see if there's any performance improvement? > > > For x86, cross-lane permutation(truncation) is not very efficient(3-4 > > > cycles > > > for both vpermq and vpmovwb). > > > > When I disable/enable ix86_expand_vecop_qihi2 with arch=skylake-avx512 on > > trunk, there is no performance regression comparing to GCC13 + avx2. > > > > It seems that the regression only happens when GCC14 + avx2. > > This is what the patch in Comment #8 prevents. skylake-avx512 enables > TARGET_AVX512BW, so VPMOVB is emitted instead of problematic VPERMQ. Yes, the patch looks good to me.