Hello, Paul, On Jan 13, 2023, Paul Koning <paulkon...@comcast.net> wrote:
>> On Jan 13, 2023, at 8:54 PM, Alexandre Oliva via Gcc-patches >> <gcc-patches@gcc.gnu.org> wrote: >> Target-specific code is great for tight optimizations, but the main >> purpose of this feature is not an optimization. AFAICT it actually >> slows things down in general (due to code growth, and to conservative >> assumptions about alignment), > I thought machinery like the memcpy patterns have as one of their > benefits the ability to find the alignment of their operands and from > that optimize things. So I don't understand why you'd say > "conservative". Though memcpy implementations normally do that indeed, dynamically increasing dest alignment has such an impact on code size that *inline* memcpy doesn't normally do that. try_store_by_multiple_pieces, specifically, is potentially branch-heavy to begin with, and bumping alignment up could double the inline expansion size. So what it does is to take the conservative dest alignment estimate from the compiler and use it. By adding leading loops to try_store_by_multiple_pieces (as does the proposed patch, with its option enabled) we may expand an unknown-length, unknown-alignment memset to something conceptually like (cims is short for constant-sized inlined memset): while (len >= 64) { len -= 64; cims(dest, c, 64); dest += 64; } if (len >= 32) { len -= 32; cims(dest, c, 32); dest += 32; } if (len >= 16) { len -= 16; cims(dest, c, 16); dest += 16; } if (len >= 8) { len -= 8; cims(dest, c, 8); dest += 8; } if (len >= 4) { len -= 4; cims(dest, c, 4); dest += 4; } if (len >= 2) { len -= 2; cims(dest, c, 2); dest += 2; } if (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; } With dynamic alignment bumps under a trivial extension of the current logic, it would become (cimsN is short for cims with dest known to be aligned to an N-byte boundary): if (len >= 2 && (dest & 1)) { len -= 1; cims(dest, c, 1); dest += 1; } if (len >= 4 && (dest & 2)) { len -= 2; cims2(dest, c, 2); dest += 2; } if (len >= 8 && (dest & 4)) { len -= 4; cims4(dest, c, 4); dest += 4; } if (len >= 16 && (dest & 8)) { len -= 8; cims8(dest, c, 8); dest += 8; } if (len >= 32 && (dest & 16)) { len -= 16; cims16(dest, c, 16); dest += 16; } if (len >= 64 && (dest & 32)) { len -= 32; cims32(dest, c, 32); dest += 32; } while (len >= 64) { len -= 64; cims64(dest, c, 64); dest += 64; } if (len >= 32) { len -= 32; cims32(dest, c, 32); dest += 32; } if (len >= 16) { len -= 16; cims16(dest, c, 16); dest += 16; } if (len >= 8) { len -= 8; cims8(dest, c, 8); dest += 8; } if (len >= 4) { len -= 4; cims4(dest, c, 4); dest += 4; } if (len >= 2) { len -= 2; cims2(dest, c, 2); dest += 2; } if (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; } Now, by using more loops instead of going through every power of two, We could shorten (for -Os) the former to e.g.: while (len >= 64) { len -= 64; cims(dest, c, 64); dest += 64; } while (len >= 8) { len -= 8; cims(dest, c, 8); dest += 8; } while (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; } and we could similarly add more compact logic for dynamic alignment: if (len >= 8) { while (dest & 7) { len -= 1; cims(dest, c, 1); dest += 1; } if (len >= 64) while (dest & 56) { len -= 8; cims8(dest, c, 8); dest += 8; } while (len >= 64) { len -= 64; cims64(dest, c, 64); dest += 64; } while (len >= 8) { len -= 8; cims8(dest, c, 8); dest += 8; } } while (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; } Now, given that improving performance was never goal of this change, and the expansion it optionally offers is desirable even when it slows things down, just making it a simple loop at the known alignment would do. The remainder sort of flowed out of the way try_store_by_multiple_pieces was structured, and I found it sort of made sense to start with the largest-reasonable block loop, and then end with whatever try_store_by_multiple_pieces would have expanded a known-shorter but variable length memset to. And this is how I got to it. I'm not sure it makes any sense to try to change things further to satisfy other competing goals such as performance or code size. -- Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ Free Software Activist GNU Toolchain Engineer Disinformation flourishes because many people care deeply about injustice but very few check the facts. Ask me about <https://stallmansupport.org>