On Wed, Feb 3, 2021 at 4:11 PM Alexandre Oliva <ol...@adacore.com> wrote: > > On Feb 3, 2021, Richard Biener <richard.guent...@gmail.com> wrote: > > > So I think we should try to match what __builtin_memcpy/memset > > expansion would do here, taking advantage of extra alignment > > and size knowledge. In particular, > > > a) if __builtin_memcpy/memset would use setmem/cpymem optabs > > see if we can have variants of memcpy/memset transferring alignment > > and size knowledge > > We could add more optional parameters to them to convey the length's > known ctz. However, the ctz can't be recovered reliably. We can't even > recover it after gimplifying the length within ldist! > > That said, my other patch already enables ctz calls to recover it, at > least in libgcc risc-v tfmode cases, and it's possible it's readily > available in other cases. I'd rather leave that for someone dealing > with the machine-specific patterns to figure out whether a separate > argument would be useful. RISC-V, which is what I'm dealing with, > doesn't have much to offer as far as these patterns are concerned. > > > b) if expansion would use BY_PIECES then expand to an unrolled loop > > Why would that be better than keeping the constant-length memset call, > that would be turned into an unrolled loop during expand?
Well, because of the possibly lost ctz and alignment info. There's also the eventual benefit of eliding either the source (for memcpy) if the variable accesses in the original loop or the address-taken for the memcpy/memset are the only reasons it is not. Of course that's then again a general GIMPLE memcpy/memset inlining deficit (we only "inline" it if it's possible to express the operation with one read and one write). > > c) if expansion would emit a memset/memcpy call but we know > > alignment and have a low bound on niters emit a loop (like your patch > > does) > > > a) might be difficult but adding the builtin variants may pay off in any > > case. > > *nod* > > > The patch itself could benefit from one or two helpers we already > > have, first of all there's create_empty_loop_on_edge (so you don't > > need the loop fixup) > > Uhh, thanks, but... you realize nearly all of the gimple-building code > is one and the same for the loop and for trailing count misalignment? Sorry, the code lacked comments and so I didn't actually try decipering the code you generate ;) > There doesn't seem to be a lot of benefit to using this primitive, aside > from its dealing with creating the loop data structure which, again, I'd > only want to do in the loop case. > > (I guess I should add more comments as to the inline expansion > strategy. it's equivalent to: > > i = len, ptr = base, blksz = 1 << alctz; > while (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += blksz; } > blksz >>= 1; if (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += > blksz; } > blksz >>= 1; if (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += > blksz; } > ... until blksz gets down to zero or to 1<<szctz Oh, I see ... fancy. Maybe a little too fancy ;) I'd have emitted a loop storing 1<<szctz pieces but then this would likely be simply the original loop in most cases. Which, for loop distribution, hints at an alternative implementation of simply guessing whether a generated memcpy/memset would be inlined by RTL expansion and in that case _not_ generate a builtin but mark the partitioned loop for GIMPLE complete unrolling. That might not be optimal and relies on later store-merging to form larger writes of course. > > Note that for memmove if we know the dependence direction, we > > can also emit a loop / unrolled code. > > *nod*, but the logic would have to be quite different, using bit tests, > and odds are we won't know the direction and have to output a test and > code for both possibilities, which would be quite unlikely to be > beneficial. Though the original code would quite likely make the > direction visible; perhaps if the size is small enough that we would > expand a memcpy inline, we should refrain from transforming the loop > into a memmove call. > > In the case at hand, there's no benefit at all to these transformations: > we start from a loop with the known alignment and a small loop count (0 > to 2 words copied), and if everything goes all right with the > transformation, we may be lucky to get back to that. It's not like the > transformation could even increase the known alignment, so why bother, > and throw away debug info by rewriting the loop into the same code minus > debug? The original motivation was really that esp. for small trip count loops the target knows best how to implement them. Now, that completely fails of course in case the target doesn't implement any of this or the generic code fails because we lost ctz and alignment info. > > I think the builtins with alignment and calloc-style element count > > will be useful on its own. > > Oh, I see, you're suggesting actual separate builtin functions. Uhh... > I'm not sure I want to go there. I'd much rather recover the ctz of the > length, and use it in existing code. Yeah, but when we generate memcpy there might not be a way to store the ctz info until RTL expansion where the magic should really happen ... > I'd also prefer if the generic memset (and memcpy and memmove?) builtin > expanders dealt with non-constant lengths in the way I implemented. *nod* > That feels like the right spot for it. That deprives us of gimple loop > optimizations in the inlined loop generated by the current patch, but if > we expand an unrolled loop with compares and offsets with small > constants, loop optimizations might not even be relevant. > > > FWIW, the patch I posted yesterday is broken, the regstrap test did not > even build libstdc++-v3 successfully. I'm not sure whether to pursue it > further, or to reimplement it in the expander. Suggestions are welcome. So I think we agree that ideally RTL expansion should be the place to recover the loop. There's left to be see how to most easily transfer alignment and ctz info there. I guess alternatively one could indeed make the point that _not_ doing the memcpy/memset replacement is likely least surprising than trying to do fancy inlining in loop distribution as compared to RTL expansion. So I'd say go for improving RTL expansion. If that for some reason fails then anticipating the RTL expansion failure cases in loop distribution and not generating the builtin there (but the loop as it would be distributed) is probably the best and simplest thing. We might consider doing RTL-like *_BY_PIECES expansion of builtins on GIMPLE as well, not at random places through fold_stmt, but at a select point in the pass pipeline. Thanks, Richard. > -- > Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ > Free Software Activist GNU Toolchain Engineer > Vim, Vi, Voltei pro Emacs -- GNUlius Caesar