On Wed, Nov 11, 2020 at 01:43:00PM -0800, Jim Wilson wrote: > > On Wed, Nov 11, 2020 at 11:43:34AM +0100, Philipp Tomsich wrote: > > > The patch addresses this by disallowing that rule, if an exact > > power-of-2 is > > > seen as C1. The reason why I would prefer to have this canonicalised the > > > same way the (X & C1) * C2 is canonicalised, is that cleaning this up > > during > > > combine is more difficult on some architectures that require multiple > > insns > > > to represent the shifted constant (i.e. C1 << C2). > > > > As I said, it is better to decide which one is better before or during > > expansion based on target costs, sure, combine can't catch everything. > > > > it could be fixed in combine if we allowed 4 instructions to split into 3. > We allow combinations of 4 insns, but we only allow splits into 2 insns as > far as I know.
If combiner can do it, good, but I think it would be still helpful for many targets to decide it during expansion, we have TER and so on expansion of BIT_AND_EXPR with a constant second operand we can check if the other one is a left shift and then based on the exact mask decide if it is useful to try different possibilities and ask for their costs (try (x & c1) << c2, (x << c3) & c4 and perhaps (x << c5) >> c6). We already have other cases where during expansion we try multiple things and compare their costs. E.g. for division and modulo, if we know that the most significant bit is clear on both of the operands, we can expand as both signed or unsigned division/modulo and so we try both and pick the one which is less costly on the target. Similarly, I think we could do it for right shifts, if we know the most significant bit is clear, we can expand it as both arithmetic and logical right shift, so we can ask the target what it prefers cost-wise. E.g. on x86_64 unsigned long long foo (unsigned long long x) { return (x << 47) >> 17; } unsigned long long bar (unsigned long long x) { return (x & 0x1ffff) << 30; } unsigned long long baz (unsigned long long x) { unsigned long long y; __asm ("" : "=r" (y) : "0" (x & 0x1ffff)); return y << 30; } unsigned long long qux (unsigned long long x) { return (x << 30) & (0x1ffffULL << 30); } foo is 4 instructions 12 bytes, baz is 4 instructions 13 bytes, bar/qux are 5 instructions 21 bytes, which of foo or baz is faster would need to be benchmarked and no idea what the rtx costs would say, but bar/qux certainly looks more costly and I'm quite sure the rtx costs would say that too. The reason for the match.pd canonicalization is I bet it wants to put possible multiple shifts adjacent and similarly with the masks, so that when one uses (((x & c1) << c2) & c3) << c4 etc. one can simplify that into just one shift and one masking; and having the canonicalization do it one way for some constants/shift pairs and another way for others wouldn't achieve that. Jakub