On Thu, Feb 4, 2021 at 11:18 PM Alexandre Oliva <ol...@adacore.com> wrote: > > On Feb 4, 2021, Richard Biener <richard.guent...@gmail.com> wrote: > > >> > b) if expansion would use BY_PIECES then expand to an unrolled loop > >> > >> Why would that be better than keeping the constant-length memset call, > >> that would be turned into an unrolled loop during expand? > > > Well, because of the possibly lost ctz and alignment info. > > Funny you should mention that. I got started with the expand-time > expansion yesterday, and found out that we're not using the alignment > information that is available. Though the pointer is known to point to > an aligned object, we are going for 8-bit alignment for some reason. > > The strategy I used there was to first check whether by_pieces would > expand inline a constant length near the max known length, then loop > over the bits in the variable length, expand in each iteration a > constant-length store-by-pieces for the fixed length corresponding to > that bit, and a test comparing the variable length with the fixed length > guarding the expansion of the store-by-pieces. We may get larger code > this way (no loops), but only O(log(len)) compares. > > I've also fixed some bugs in the ldist expander, so now it bootstraps, > but with a few regressions in the testsuite, that I'm yet to look into. > > >> Uhh, thanks, but... you realize nearly all of the gimple-building code > >> is one and the same for the loop and for trailing count misalignment? > > > Sorry, the code lacked comments and so I didn't actually try decipering > > the code you generate ;) > > Oh, come on, it was planly obscure ;-D > > Sorry for posting an early-draft before polishing it up. > > > The original motivation was really that esp. for small trip count loops > > the target knows best how to implement them. Now, that completely > > fails of course in case the target doesn't implement any of this or > > the generic code fails because we lost ctz and alignment info. > > In our case, generic code fails because it won't handle variable-sized > clear-by-pieces. But then, I found out, when it's fixed-size, it also > makes the code worse, because it seems to expand to byte stores even > when the store-to object is known to have wider alignment: > > union u { > long long i; > char c[8]; > } x[8]; > int s(union u *p, int k) { > for (int i = k ? 0 : 3; i < 8; i++) { > for (int j = 0; j < 8; j++) { > p[i].c[j] = 0; > } // becomes a memset to an 8-byte-aligned 8-byte object, then 8 byte > stores > } > }
On x86_64 I see two DImode stores generated, but that appears to be done via setmem. > >> > I think the builtins with alignment and calloc-style element count > >> > will be useful on its own. > >> > >> Oh, I see, you're suggesting actual separate builtin functions. Uhh... > >> I'm not sure I want to go there. I'd much rather recover the ctz of the > >> length, and use it in existing code. > > > Yeah, but when we generate memcpy there might not be a way to > > store the ctz info until RTL expansion where the magic should really happen > > ... > > True. It can be recovered without much difficulty in the cases I've > looked at, but it could be lost in others. > > > So I'd say go for improving RTL expansion. > > 'k, thanks > > -- > Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ > Free Software Activist GNU Toolchain Engineer > Vim, Vi, Voltei pro Emacs -- GNUlius Caesar