On Sep 2, 2020, Segher Boessenkool <seg...@kernel.crashing.org> wrote:
> Hi! > On Tue, Sep 01, 2020 at 07:22:57PM -0300, Alexandre Oliva wrote: >> It's ugly, I know. > Yeah, it's bloody disgusting :-) :-) > (And aarch, but not the other targets with default clobbers). And arm. I didn't look very hard for others yet; I wasn't sure it would be worth pursuing any further. >> even special-casing empty asm patterns (at a slight risk of breaking >> code that expects implicit clobbers even for empty asm patterns, but >> empty asm patterns won't clobber flags, so how could it break >> anything?). > People write empty asm statements not because they would like no insns > emitted from it, but *because* they want the other effects an asm has > (for example, an empty asm usually has no outputs, so it is volatile, > and that makes sure it is executed in the real machine exactly as often > as in the abstract machine). So your expectation might be wrong, > someone might want an empty asm to clobber cc on x86 (like any asm is > documented as doing). I just can't imagine a case in which flags set by an earlier instruction, and that could be usable by a subsequent instruction, could possibly be clobbered by an *empty* sequence of insns. Maybe I just lack imagination about all the sorts of weird things that people use such asm statements for... Got any example of something that could be broken by such a change to educate me? Not that I need it, I'm just curious. > But how about a "none" clobber? I like that! >> I take this might be useful for do-nothing asm >> statements, often used to stop certain optimizations, e.g.: >> >> __typeof (*p) c = __builtin_add_overflow (*p, 1, p); >> asm ("" : "+m" (*p)); // Make sure we write to memory. >> *p += c; // This should compile into an add with carry. > Wow, nasty. That asm cannot be optimised away even if the rest is > (unless GCC can somehow figure out nothing ever uses *p). Is there no > better way to do this? I don't know. There's a piece of hand-crafted x86 assembly that I'm trying to find some way to replicate in compiler-optimized code. I don't know whether two RMWs could possibly be faster, but there are two concerns driving me to try to use the same insn sequence: - compactness; this is for per-block instrumentation, and if we use 4 larger insns instead of 2 smaller ones, there might be even more significant impact on performance because of code cache issues - threads: though the instrumentation code is definitely not thread-safe, and failing to expose the intermediate state should not have a significant impact, I wanted to at least be able to compare the use of code as in existing instrumentation code with compiler-generated instrumentation, and measure any performance differences and actual impact on multi-threaded programs. >> Without the asm, we issue load;add;adc;store, which is not the ideal >> sequence with add and adc to the same memory address (or two different >> addresses, if the last statement uses say *q instead of *p). > Is doing two RMWs on memory faster? Huh. I haven't really measured it (yet), I just assumed whoever put the asm code there like that, who seemed to be both knowledgeable and concerned about the performance impact on a very hot piece of instrumentation code, knew what they were doing, but when I set out to check it myself, I've hit this wall. At first I though GCC was picking the sequence it did because it was faster, but then, while I investigated, I came to the conclusion that it picked it because it didn't stand a chance of doing otherwise, not even if I pushed it really hard. >> Alas, getting the first add to go straight to memory is more >> complicated. Even with the asm that forces the output to memory, the >> output flag makes it harder to get it optimized to an add-to-memory >> form. When the output flag is unused, we optimize it enough in gimple >> that TER does its job and we issue a single add, but that's not possible >> when the two outputs of ADD_OVERFLOW are used: the flag setting gets >> optimized away, but only after stopping combine from turning the >> load/add/store into an add-to-memory. >> >> If we retried the 3-insn substitution after substituting the flag store >> into the add for the adc, > combine should retry every combination if any of the input insns to it > have changed (put another way, if any insn is changed all combinations > with it are tried anew). Alas, they don't change, it's an intervening flag-store that does. When we try the 3 insns that make up RMW, while the flag store is still there before the W, it triggers one of the cant_combine tests. After it's gone, we don't reconsider. Does that count as a bug to be filed? > But. Dependencies through memory are never used for combine (it uses > dependencies through registers only), maybe that is what you are seeing? No, it's just that the use of the flag between M and W prevents substituting the M into the W, because then the use of the parallel set result would be before the M. If there weren't a use of the parallel flags set, we'd have combined the 3 RMW insns into 1 just fine. >> we might succeed, but only if we had a pattern >> that matched add<mode>3_cc_overflow_1's parallel with the flag-setter as >> the second element of the parallel, because that's where combine adds it >> to the new i3 pattern, after splitting it out of i2. > That sounds like the backend pattern has it wrong then? There is a > canonical order for this? Much as I can tell, there isn't, it's just an arbitrary choice of backends, some do it one way or the other, and that causes combine to be able to perform some combinations but not others. >> I suppose adding such patterns manually isn't the way to go. I wonder >> if getting recog_for_combine to recognize and reorder PARALLELs >> appearing out of order would get too expensive, even if genrecog were to >> generate optimized code to try alternate orders in parallels. > Very big parallels are used, and trying all orderings would take just a > little too much time ;-) Oh, I wasn't talking about trying all permutations in recog_for_combine, but about generating recog code, for combine to use, that could recognize out-of-order top-level parallels, so that they could be reordered into something regular recog would recognize. I suppose this might involve computing permutations within genrecog... Here's a hack just to get a clue about extra time. On amd64, we go from: Shared 39391 out of 75775 states by creating 10001 new states, saving 29390 to: Shared 45474 out of 86045 states by creating 11586 new states, saving 33888 For comparisong purposes, here's what we get permuting up to 8 items in a parallel (MAX_PERMUTE bumped up to 8), without involving trailing clobbers in the permutations: Shared 66206 out of 127097 states by creating 16663 new states, saving 49543 While this enables permutations among trailing clobbers, also up to the 8th item in the parallel (PERMUTE_CLOBBERS 8): Shared 71133 out of 140924 states by creating 17047 new states, saving 54086 While this adds partial removal of clobbers, leaving up to 8 of them for matching: Shared 86720 out of 193474 states by creating 17483 new states, saving 69237 What this patchlet does *not* do is to encode the permutations into e.g. num_clobbers, so that we could apply them afterwards. With a 32-bit int, we could encode permutations of MAX_PERMUTE <= 8 assuming the original num_clobbers is always < 52, or <= 7 if num_clobbers could go up to 415. Not really tested, other than gathering the state counts above. Does it look promising? diff --git a/gcc/genrecog.c b/gcc/genrecog.c index 5b5b72f..6daef51 100644 --- a/gcc/genrecog.c +++ b/gcc/genrecog.c @@ -5295,20 +5295,15 @@ get_peephole2_pattern (md_rtx_info *info) return pattern; } -/* Return true if *PATTERN_PTR is a PARALLEL in which at least one trailing - rtx can be added automatically by add_clobbers. If so, update - *ACCEPTANCE_PTR so that its num_clobbers field contains the number - of such trailing rtxes and update *PATTERN_PTR so that it contains - the pattern without those rtxes. */ +/* Return the number of trailing rtx in a PARALLEL that + can be added automatically by add_clobbers. */ -static bool -remove_clobbers (acceptance_type *acceptance_ptr, rtx *pattern_ptr) +static int +remove_clobbers (rtx pattern) { int i; - rtx new_pattern; /* Find the last non-clobber in the parallel. */ - rtx pattern = *pattern_ptr; for (i = XVECLEN (pattern, 0); i > 0; i--) { rtx x = XVECEXP (pattern, 0, i - 1); @@ -5318,24 +5313,69 @@ remove_clobbers (acceptance_type *acceptance_ptr, rtx *pattern_ptr) break; } - if (i == XVECLEN (pattern, 0)) - return false; + return XVECLEN (pattern, 0) - i; +} - /* Build a similar insn without the clobbers. */ - if (i == 1) - new_pattern = XVECEXP (pattern, 0, 0); - else +/* Match a PATTERN with a top-level PARALLEL, with permutations + starting at POS and up to MAX_PERMUTE. CLOB is the trailing + clobbers count (returned by remove_clobbers). Only up to + PERMUTE_CLOBBERS undergo permutation. PARTIAL_CLOBBERS, if + positive, enables matching of patterns with partial removal of + trailing clobbers, so that up to (possibly permuted) + PARTIAL_CLOBBERS remain. */ + +static void +match_parallel (state *s, md_rtx_info *info, rtx pattern, + acceptance_type acceptance, int pos, int clob) +{ + +#define MAX_PERMUTE 3 +#define PERMUTE_CLOBBERS 0 +#define PARTIAL_CLOBBERS 0 + + int len = XVECLEN (pattern, 0); + int len_clob = len - clob; + + if (pos == len) { - new_pattern = rtx_alloc (PARALLEL); - XVEC (new_pattern, 0) = rtvec_alloc (i); - for (int j = 0; j < i; ++j) - XVECEXP (new_pattern, 0, j) = XVECEXP (pattern, 0, j); + match_pattern (s, info, pattern, acceptance); + return; } + + match_parallel (s, info, pattern, acceptance, pos + 1, clob); - /* Recognize it. */ - acceptance_ptr->u.full.u.num_clobbers = XVECLEN (pattern, 0) - i; - *pattern_ptr = new_pattern; - return true; + if (pos < len_clob) + { + for (int i = pos + 1; i < len_clob && i < MAX_PERMUTE; i++) + { + std::swap (XVECEXP (pattern, 0, pos), XVECEXP (pattern, 0, i)); + match_parallel (s, info, pattern, acceptance, pos + 1, clob); + std::swap (XVECEXP (pattern, 0, pos), XVECEXP (pattern, 0, i)); + } + } + else + { + for (int i = pos + 1; i < len && i < MAX_PERMUTE - 1 + && i < len_clob + PERMUTE_CLOBBERS; i++) + { + std::swap (XVECEXP (pattern, 0, pos), XVECEXP (pattern, 0, i)); + match_parallel (s, info, pattern, acceptance, pos + 1, clob); + std::swap (XVECEXP (pattern, 0, pos), XVECEXP (pattern, 0, i)); + } + + if (pos <= len_clob + PARTIAL_CLOBBERS) + { + acceptance.u.full.u.num_clobbers = len - pos; + if (pos == 1) + match_pattern (s, info, XVECEXP (pattern, 0, 0), acceptance); + else + { + XVECLEN (pattern, 0) = pos; + match_pattern (s, info, pattern, acceptance); + XVECLEN (pattern, 0) = len; + } + } + } } int @@ -5371,13 +5411,15 @@ main (int argc, const char **argv) acceptance.u.full.u.num_clobbers = 0; pattern = add_implicit_parallel (XVEC (def, 1)); validate_pattern (pattern, &info, NULL_RTX, 0); - match_pattern (&insn_root, &info, pattern, acceptance); /* If the pattern is a PARALLEL with trailing CLOBBERs, allow recog_for_combine to match without the clobbers. */ - if (GET_CODE (pattern) == PARALLEL - && remove_clobbers (&acceptance, &pattern)) + if (GET_CODE (pattern) != PARALLEL) match_pattern (&insn_root, &info, pattern, acceptance); + else + match_parallel (&insn_root, &info, pattern, acceptance, + 0, remove_clobbers (pattern)); + break; } -- Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ Free Software Activist GNU Toolchain Engineer