On Jan 16, 2023, Richard Biener <richard.guent...@gmail.com> wrote: > On Sat, Jan 14, 2023 at 2:55 AM Alexandre Oliva <ol...@adacore.com> wrote: >> Target-specific code is great for tight optimizations, but the main >> purpose of this feature is not an optimization. AFAICT it actually >> slows things down in general (due to code growth, and to conservative >> assumptions about alignment), except perhaps for some microbenchmarks. >> It's rather a means to avoid depending on the C runtime, particularly >> due to compiler-introduced memset calls.
> OK, that's what I guessed but you didn't spell out. So does it make sense > to mention -ffreestanding in the docs at least? My fear is that we'd get > complaints that -O3 -finline-memset-loops turns nicely optimized memset > loops into dumb ones (via loop distribution and then stupid re-expansion). > So does it also make sense to turn off -floop-distribute-patterns[-memset] > with -finline-memset-loops? I don't think they should be tied together. Verbose as it is, the expansion of memset is a sort of local optimum given what the compiler knows about length and alignment: minimizing what can be minimized, namely the compare count, by grouping stores in large straight-line blocks. Though an optimized memset could in theory perform better, whether through ifuncs or by bumping alignment, if you're tuning generated code for a specific target, and you know dest is aligned, the generated code can likely beat a general-purpose optimized memset, even if by a thin margin, such as the code that the general-purpose memset would have to run to detect the alignment and realize it doesn't need to be bumped, and to extend the byte value to be stored to wider modes. So I can envision cases in which -floop-distribute-patterns could turn highly inefficient stores into a memset with known-strict alignment and length multiplier, that could then be profitably expanded inline so as to take advantage of both for performance reasons. Indeed, when I started working on this, I thought the issue was performance, and this led me to pursue the store-by-multiple-pieces logic. It can indeed bring about performance improvements, both over generic-target and highly-optimized memset implementations. But it can also be put to use to avoid C runtime calls. So while I wouldn't suggest enabling it by default at any optimization level, I wouldn't tie it to the single purpose of freestanding environments either. >> My initial goal was to be able to show that inline expansion would NOT >> bring about performance improvements, but performance was not the >> concern that led to the request. >> >> If the approach seems generally acceptable, I may even end up extending >> it to other such builtins. I have a vague recollection that memcmp is >> also an issue for us. > The C/C++ runtime produce at least memmove, memcpy and memcmp as well. *nod*. The others are far more challenging to expand inline in a way that could potentially be more performant: - memcmp can only do by_pieces when testing for equality, presumably because grouping multiple bytes into larger words to speed things up won't always get you the expected result if you just subtract the larger words, endianness reversal prior to subtracting might be required, which would harm performance. I don't see that using similar power-of-two-sizes grouping strategies to minimize looping overheads would be so advantageous, if at all, given the need for testing and branching at every word. - memcpy seems doable, but all of the block moves other than cpymem assume non-overlapping memcpy. Even if we were to output a test for overlap that a naïve expansion would break, and an alternate block to go backwards, all of the block copying logic would have to be "oriented" to proceed explicitly forward, backward, or don't-care, where currently we only have don't-care. Though my initial plan, when posting this patch, was to see how well the general approach was received, before thinking much about how to apply it to the other builtins, now I am concerned that extending it to them is more than I can tackle. Would it make more sense to extend it, even constrained by the limitations mentioned above, or handle memset only? In the latter case, would it still make sense to adopt a command-line option that suggests a broader effect than it already has, even if it's only a hopeful future extension? -finline-all-stringops[={memset,memcpy,...}], that you suggested, seems to be a reasonable and extensible one to adopt. >> Is (optionally) tending to this (uncommon, I suppose) need (or >> preference?) not something GCC would like to do? > Sure, I think for the specific intended purpose that would be fine. Cool! > It should also only apply to __builtin_memset calls, not to memset > calls from user code? I suppose it could be argued both ways. The situations that I had in mind either already are or could be made __builtin_memset calls, but I can't think of reasons to prevent explicit memset calls from such expansion. Do you have any specific purpose in mind? In favor of allowing non-__builtin_ memset to be so expanded, I'll mention that I caught a number of corner cases while developing the following improved version of the earlier patch (now handling even large-constant lengths) by bootstrapping the compiler with the option enabled by default. Without that, testcases would have to be a *lot* more thorough. Introduce -finline-memset-loops From: Alexandre Oliva <ol...@adacore.com> try_store_by_multiple_pieces was added not long ago, enabling variable-sized memset to be expanded inline when the worst-case in-range constant length would, using conditional blocks with powers of two to cover all possibilities of length and alignment. This patch extends the memset expansion to start with a loop, so as to still take advantage of known alignment even with long lengths, but without necessarily adding store blocks for every power of two. This makes it possible for any memset call to be expanded, even if storing a single byte per iteration. Surely efficient implementations of memset can do better, with a pre-loop to increase alignment, but that would likely be excessive for inline expansions of memset. Still, in some cases, users prefer to inline memset, even if it's not as performant, or when it's known to be performant in ways the compiler can't tell, to avoid depending on a C runtime library. With this flag, global or per-function optimizations may enable inline expansion of memset to be selectively requested, while the infrastructure for that may enable us to introduce per-target tuning to enable such looping even advantageous, even if not explicitly requested. for gcc/ChangeLog * builtins.cc (try_store_by_multiple_pieces): Support starting with a loop. * common.opt (finline-memset-loops): New. * doc/invoke.texi (-finline-memset-loops): Add. for gcc/testsuite/ChangeLog * gcc.dg/torture/inline-mem-set-1.c: New. --- gcc/builtins.cc | 99 +++++++++++++++++++++-- gcc/common.opt | 4 + gcc/doc/invoke.texi | 13 +++ gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c | 84 ++++++++++++++++++++ 4 files changed, 192 insertions(+), 8 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c diff --git a/gcc/builtins.cc b/gcc/builtins.cc index af45829875e6c..733fe17ede622 100644 --- a/gcc/builtins.cc +++ b/gcc/builtins.cc @@ -4322,6 +4322,10 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len, int tst_bits = (max_bits != min_bits ? max_bits : floor_log2 (max_len ^ min_len)); + /* Save the pre-blksize values. */ + int orig_max_bits = max_bits; + int orig_tst_bits = tst_bits; + /* Check whether it's profitable to start by storing a fixed BLKSIZE bytes, to lower max_bits. In the unlikely case of a constant LEN (implied by identical MAX_LEN and MIN_LEN), we want to issue a @@ -4361,9 +4365,70 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len, if (max_bits >= 0) xlenest += ((HOST_WIDE_INT_1U << max_bits) * 2 - (HOST_WIDE_INT_1U << ctz_len)); - if (!can_store_by_pieces (xlenest, builtin_memset_read_str, - &valc, align, true)) - return false; + bool max_loop = false; + /* Skip the test in case of overflow in xlenest. It shouldn't + happen because of the way max_bits and blksize are related, but + it doesn't hurt to test. */ + if (blksize > xlenest + || !can_store_by_pieces (xlenest, builtin_memset_read_str, + &valc, align, true)) + { + if (!flag_inline_memset_loops) + return false; + + for (max_bits = orig_max_bits; + max_bits >= sctz_len; + --max_bits) + { + xlenest = ((HOST_WIDE_INT_1U << max_bits) * 2 + - (HOST_WIDE_INT_1U << ctz_len)); + /* Check that blksize plus the bits to be stored as blocks + sized at powers of two can be stored by pieces. This is + like the test above, but with smaller max_bits. Skip + orig_max_bits (it would be redundant). Also skip in case + of overflow. */ + if (max_bits < orig_max_bits + && xlenest + blksize >= xlenest + && can_store_by_pieces (xlenest + blksize, + builtin_memset_read_str, + &valc, align, true)) + { + max_loop = true; + break; + } + if (blksize + && can_store_by_pieces (xlenest, + builtin_memset_read_str, + &valc, align, true)) + { + max_len += blksize; + min_len += blksize; + tst_bits = orig_tst_bits; + blksize = 0; + max_loop = true; + break; + } + if (max_bits == sctz_len) + { + --sctz_len; + --ctz_len; + } + } + if (!max_loop) + return false; + /* If the boundaries are such that min and max may run a + different number of trips in the initial loop, the remainder + needs not be between the moduli, so set tst_bits to cover all + bits. Otherwise, if the trip counts are the same, max_len + has the common prefix, and the previously-computed tst_bits + is usable. */ + if (max_len >> max_bits > min_len >> max_bits) + tst_bits = max_bits; + } + /* ??? Do we have to check that all powers of two lengths from + max_bits down to ctz_len pass can_store_by_pieces? As in, could + it possibly be that xlenest passes while smaller power-of-two + sizes don't? */ by_pieces_constfn constfun; void *constfundata; @@ -4405,7 +4470,9 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len, the least significant bit possibly set in the length. */ for (int i = max_bits; i >= sctz_len; i--) { + rtx_code_label *loop_label = NULL; rtx_code_label *label = NULL; + blksize = HOST_WIDE_INT_1U << i; /* If we're past the bits shared between min_ and max_len, expand @@ -4419,18 +4486,31 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len, profile_probability::even ()); } /* If we are at a bit that is in the prefix shared by min_ and - max_len, skip this BLKSIZE if the bit is clear. */ - else if ((max_len & blksize) == 0) + max_len, skip the current BLKSIZE if the bit is clear, but do + not skip the loop, even if it doesn't require + prechecking. */ + else if ((max_len & blksize) == 0 + && !(max_loop && i == max_bits)) continue; + if (max_loop && i == max_bits) + { + loop_label = gen_label_rtx (); + emit_label (loop_label); + /* Since we may run this multiple times, don't assume we + know anything about the offset. */ + clear_mem_offset (to); + } + /* Issue a store of BLKSIZE bytes. */ + bool update_needed = i != sctz_len || loop_label; to = store_by_pieces (to, blksize, constfun, constfundata, align, true, - i != sctz_len ? RETURN_END : RETURN_BEGIN); + update_needed ? RETURN_END : RETURN_BEGIN); /* Adjust REM and PTR, unless this is the last iteration. */ - if (i != sctz_len) + if (update_needed) { emit_move_insn (ptr, force_operand (XEXP (to, 0), NULL_RTX)); to = replace_equiv_address (to, ptr); @@ -4438,6 +4518,11 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len, emit_move_insn (rem, force_operand (rem_minus_blksize, NULL_RTX)); } + if (loop_label) + emit_cmp_and_jump_insns (rem, GEN_INT (blksize), GE, NULL, + ptr_mode, 1, loop_label, + profile_probability::likely ()); + if (label) { emit_label (label); diff --git a/gcc/common.opt b/gcc/common.opt index d0371aec8db5f..5d28f054241ad 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -1874,6 +1874,10 @@ finline-atomics Common Var(flag_inline_atomics) Init(1) Optimization Inline __atomic operations when a lock free instruction sequence is available. +finline-memset-loops +Common Var(flag_inline_memset_loops) Init(0) Optimization +Inline memset even if it requires loops. + fcf-protection Common RejectNegative Alias(fcf-protection=,full) diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 631c00582bf85..8f8d52bbeef68 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -548,7 +548,8 @@ Objective-C and Objective-C++ Dialects}. -fgcse-sm -fhoist-adjacent-loads -fif-conversion @gol -fif-conversion2 -findirect-inlining @gol -finline-functions -finline-functions-called-once -finline-limit=@var{n} @gol --finline-small-functions -fipa-modref -fipa-cp -fipa-cp-clone @gol +-finline-memset-loops -finline-small-functions @gol +-fipa-modref -fipa-cp -fipa-cp-clone @gol -fipa-bit-cp -fipa-vrp -fipa-pta -fipa-profile -fipa-pure-const @gol -fipa-reference -fipa-reference-addressable @gol -fipa-stack-alignment -fipa-icf -fira-algorithm=@var{algorithm} @gol @@ -11961,6 +11962,16 @@ in its own right. Enabled at levels @option{-O1}, @option{-O2}, @option{-O3} and @option{-Os}, but not @option{-Og}. +@item -finline-memset-loops +@opindex finline-memset-loops +Expand @code{memset} calls inline, even when the length is variable or +big enough as to require looping. This may enable the compiler to take +advantage of known alignment and length multipliers, but it will often +generate code that is less efficient than performant implementations of +@code{memset}, and grow code size so much that even a less performant +@code{memset} may run faster due to better use of the code cache. This +option is disabled by default. + @item -fearly-inlining @opindex fearly-inlining Inline functions marked by @code{always_inline} and functions whose body seems diff --git a/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c b/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c new file mode 100644 index 0000000000000..4de51df006ede --- /dev/null +++ b/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c @@ -0,0 +1,84 @@ +/* { dg-do compile } */ +/* { dg-options "-finline-memset-loops -gno-record-gcc-switches -fno-lto" } */ + +void *zero (unsigned long long (*p)[32], int n) +{ + return __builtin_memset (p, 0, n * sizeof (*p)); +} + +void *ones (char (*p)[128], int n) +{ + return __builtin_memset (p, -1, n * sizeof (*p)); +} + +void *opt2 (int *p, int i) +{ + return __builtin_memset (p, 0, (i ? 1024 : 2) * sizeof (*p)); +} + +void *opt8 (int *p, int i) +{ + return __builtin_memset (p, 0, (i ? 1024 : 8) * sizeof (*p)); +} + +void *opt32 (int *p, int i) +{ + return __builtin_memset (p, 0, (i ? 1024 : 32) * sizeof (*p)); +} + +void *opt128 (int *p, int i) +{ + return __builtin_memset (p, 0, (i ? 1024 : 128) * sizeof (*p)); +} + +void *opt512 (int *p, int i) +{ + return __builtin_memset (p, 0, (i ? 1024 : 512) * sizeof (*p)); +} + +void *opt_primes (int *p, int i) +{ + return __builtin_memset (p, 0, (i ? 509 : 7) * sizeof (*p)); +} + +void *opt_primes_blk (int *p, int i) +{ + return __builtin_memset (p, 0, (i ? 521 : 9) * sizeof (*p)); +} + +void *huge (long (*p)[16384]) +{ + return __builtin_memset (p, 0, sizeof (*p)); +} + +void *hugep1 (long (*p)[16384+1]) +{ + return __builtin_memset (p, 0, sizeof (*p)); +} + +void *hugep4 (long (*p)[16384+4]) +{ + return __builtin_memset (p, 0, sizeof (*p)); +} + +void *hugep16 (long (*p)[16384+16]) +{ + return __builtin_memset (p, 0, sizeof (*p)); +} + +void *hugep64 (long (*p)[16384+64]) +{ + return __builtin_memset (p, 0, sizeof (*p)); +} + +void *hugep256 (long (*p)[16384+256]) +{ + return __builtin_memset (p, 0, sizeof (*p)); +} + +void *hugep1024p256p64p16p4p1 (long (*p)[16384+1024+64+16+4+1]) +{ + return __builtin_memset (p, 0, sizeof (*p)); +} + +/* { dg-final { scan-assembler-not "memset" } } */ -- Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ Free Software Activist GNU Toolchain Engineer Disinformation flourishes because many people care deeply about injustice but very few check the facts. Ask me about <https://stallmansupport.org>