On Wed, Sep 17, 2025 at 7:59 AM Hongtao Liu <crazy...@gmail.com> wrote: > > On Wed, Sep 17, 2025 at 2:12 PM Hongtao Liu <crazy...@gmail.com> wrote: > > > > On Wed, Sep 17, 2025 at 2:08 PM Hongtao Liu <crazy...@gmail.com> wrote: > > > > > > On Tue, Sep 16, 2025 at 3:53 PM Liu, Hongtao <hongtao....@intel.com> > > > wrote: > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Richard Biener <richard.guent...@gmail.com> > > > > > Sent: Tuesday, September 16, 2025 3:03 PM > > > > > To: Liu, Hongtao <hongtao....@intel.com> > > > > > Cc: gcc-patches@gcc.gnu.org; hjl.to...@gmail.com > > > > > Subject: Re: [PATCH] Remove SPR/GNR/DMR from avx512_{move,store}_by > > > > > pieces tune. > > > > > > > > > > On Tue, Sep 16, 2025 at 7:53 AM liuhongt <hongtao....@intel.com> > > > > > wrote: > > > > > > > > > > > > From: "hongtao.liu" <hongtao....@intel.com> > > > > > > > > > > > > Align move_max with prefer_vector_width for SPR/GNR/DMR to avoid > > > > > > STLF > > > > > issue. > > > > > > It's similar as previous commit. > > > > > > > > > > > > commit 6ea25c041964bf63014fcf7bb68fb1f5a0a4e123 > > > > > > Author: liuhongt <hongtao....@intel.com> > > > > > > Date: Thu Aug 15 12:54:07 2024 +0800 > > > > > > > > > > > > Align ix86_{move_max,store_max} with vectorizer. > > > > > > > > > > > > When none of mprefer-vector-width, > > > > > > avx256_optimal/avx128_optimal, > > > > > > avx256_store_by_pieces/avx512_store_by_pieces is specified, GCC > > > > > > will > > > > > > set ix86_{move_max,store_max} as max available vector length > > > > > > except > > > > > > for AVX part. > > > > > > > > > > > > if (TARGET_AVX512F_P (opts->x_ix86_isa_flags) > > > > > > && TARGET_EVEX512_P (opts->x_ix86_isa_flags2)) > > > > > > opts->x_ix86_move_max = PVW_AVX512; > > > > > > else > > > > > > opts->x_ix86_move_max = PVW_AVX128; > > > > > > > > > > > > So for -mavx2, vectorizer will choose 256-bit for > > > > > > vectorization, but > > > > > > 128-bit is used for struct copy, there could be a potential > > > > > > STLF issue > > > > > > due to this "misalign". > > > > > > > > > > > > > > > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > > > > > Ready push to trunk. > > > > > > > > > > Not an objection to this patch, but I wonder if we have too many > > > > > tunables > > > > > where these kind of interactions cause more problems than positives by > > > > > having them? > > > > My experience is we should align MOVE_MAX with prefer_vector_width, > > > > vectorization and memcpy are 2 main parts which create vector > > > > load/store, they'd better align with each other. > > > > > > > > > > As general comment I'd say store-by-pieces with larger modes is > > > > > always good, > > > > > but the read part for move-by-pieces can cause STLF fails, not only > > > > > from > > > > Yes, avx512_store_by_piece is still good, I'll adjust my patch. > > > > > > #define MOVE_MAX \ > > > ((TARGET_AVX512F \ > > > && (ix86_move_max == PVW_AVX512 \ > > > || ix86_store_max == PVW_AVX512)) \ > > > ? 64 \ > > > > > > Since MOVE_MAX in x86 is also related to ix86_store_max, I'll still > > > remove SPR/GNR/DMR from avx512_store_by_pieces. > > > And why do we have ix86_store_max in MOVE_MAX? > > > I guess, it's because: > > > In the middle-end, the real size used for memset is decided by MIN > > > (MOVE_MAX_PIECES(alignment request), STORE_MAX_PIECES) > > > > > > Maybe we should remove the option mstore-max= and > > > {avx512,avx256}_store_by_pieces. > > > Since they eventually have the same impact as just setting ix86_move_max. > > Or maybe we should even remove -mmove-max and just set MOVE_MAX as > > prefer_vector_width due to STLF issue. > But for the double-pumping processor, store 256 followed by 512 load > probably ok, no STLF issue.
Zen4 at least cannot forward this. Also even Zen5 has a 256bit wide store buffer, so stores are split. Curiously 512bit load/store can forward just fine (but not 2x256bit store -> 512bit load). In the past it was said the x86 memory model makes any forwarding from multiple pieces "difficult" at least (though I don't remember any details). Can the Intel small cores forward 2xSSE store to 1xAVX load? > So mmove-max could still be useful for that scenario. > > > > > > > > > > > > > > > vectorization but of course also from (non-contigous) scalar stores > > > > > performed > > > > > by user code. > > > > > > > > > > I'm always advising CPU implementors to reduce STLF fail penalties > > > > > (or allow > > > > > more forwarding cases). On the compilers side we'd have to see to do > > > > > less > > > > > "local" (per stmt) decisions on what modes to use but take into > > > > > account > > > > > context - the post-fact fixup by the avoid-store-forwarding pass is > > > > > likely not > > > > > very effective? > > > > I think the pass can't handle inter-procedure pointer analysis, and > > > > there're memcpy and vectorization in caller/callee with different size, > > > > worst case is small store in the caller followed by big load in the > > > > callee. > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > > > * config/i386/x86-tune.def (X86_TUNE_AVX512_MOVE_BY_PIECES): > > > > > > (X86_TUNE_AVX512_STORE_BY_PIECES): > > > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > > > * gcc.target/i386/pieces-memcpy-18.c: Use -mtune=znver5 > > > > > > instead of -mtune=sapphirerapids. > > > > > > * gcc.target/i386/pieces-memcpy-21.c: Ditto. > > > > > > * gcc.target/i386/pieces-memset-46.c: Ditto. > > > > > > * gcc.target/i386/pieces-memset-49.c: Ditto. > > > > > > --- > > > > > > gcc/config/i386/x86-tune.def | 6 ++---- > > > > > > gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c | 2 +- > > > > > > gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c | 2 +- > > > > > > gcc/testsuite/gcc.target/i386/pieces-memset-46.c | 2 +- > > > > > > gcc/testsuite/gcc.target/i386/pieces-memset-49.c | 2 +- > > > > > > 5 files changed, 6 insertions(+), 8 deletions(-) > > > > > > > > > > > > diff --git a/gcc/config/i386/x86-tune.def > > > > > > b/gcc/config/i386/x86-tune.def index a86cbad281c..8e642aa92a2 100644 > > > > > > --- a/gcc/config/i386/x86-tune.def > > > > > > +++ b/gcc/config/i386/x86-tune.def > > > > > > @@ -625,14 +625,12 @@ DEF_TUNE > > > > > (X86_TUNE_AVX256_STORE_BY_PIECES, > > > > > > "avx256_store_by_pieces", > > > > > > /* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces with > > > > > 512-bit > > > > > > AVX instructions. */ > > > > > > DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, > > > > > "avx512_move_by_pieces", > > > > > > - m_SAPPHIRERAPIDS | m_GRANITERAPIDS | m_GRANITERAPIDS_D > > > > > > - | m_DIAMONDRAPIDS | m_ZNVER4 | m_ZNVER5) > > > > > > + m_ZNVER4 | m_ZNVER5) > > > > > > > > > > > > /* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize store_by_pieces with > > > > > 512-bit > > > > > > AVX instructions. */ > > > > > > DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, > > > > > "avx512_store_by_pieces", > > > > > > - m_SAPPHIRERAPIDS | m_GRANITERAPIDS | m_GRANITERAPIDS_D > > > > > > - | m_DIAMONDRAPIDS | m_ZNVER4 | m_ZNVER5) > > > > > > + m_ZNVER4 | m_ZNVER5) > > > > > > > > > > > > /* X86_TUNE_AVX512_TWO_EPILOGUES: Use two vector epilogues for > > > > > 512-bit > > > > > > vectorized loops. */ > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c > > > > > > b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c > > > > > > index b15a0db9ff0..b4995ac0598 100644 > > > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c > > > > > > @@ -1,5 +1,5 @@ > > > > > > /* { dg-do compile } */ > > > > > > -/* { dg-options "-O2 -march=sapphirerapids" } */ > > > > > > +/* { dg-options "-O2 -march=znver5" } */ > > > > > > > > > > > > extern char *dst, *src; > > > > > > > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c > > > > > > b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c > > > > > > index ef439f20f74..804a2989d64 100644 > > > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c > > > > > > @@ -1,5 +1,5 @@ > > > > > > /* { dg-do compile } */ > > > > > > -/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } > > > > > > */ > > > > > > +/* { dg-options "-O2 -mtune=znver5 -march=x86-64 -mavx2" } */ > > > > > > > > > > > > extern char *dst, *src; > > > > > > > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c > > > > > > b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c > > > > > > index be1b054eed2..43d636ee3ff 100644 > > > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c > > > > > > @@ -1,5 +1,5 @@ > > > > > > /* { dg-do compile } */ > > > > > > -/* { dg-options "-O2 -march=sapphirerapids" } */ > > > > > > +/* { dg-options "-O2 -march=znver5" } */ > > > > > > > > > > > > extern char *dst; > > > > > > > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c > > > > > > b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c > > > > > > index ad43f89a9bd..ca4933ac1d8 100644 > > > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c > > > > > > @@ -1,5 +1,5 @@ > > > > > > /* { dg-do compile } */ > > > > > > -/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } > > > > > > */ > > > > > > +/* { dg-options "-O2 -mtune=znver5 -march=x86-64 -mavx2" } */ > > > > > > > > > > > > extern char *dst; > > > > > > > > > > > > -- > > > > > > 2.34.1 > > > > > > > > > > > > > > > > > > -- > > > BR, > > > Hongtao > > > > > > > > -- > > BR, > > Hongtao > > > > -- > BR, > Hongtao