Hi Segher,
on 2023/5/25 19:22, Segher Boessenkool wrote:
> Hi!
>
> On Thu, May 25, 2023 at 07:05:55AM -0300, Alexandre Oliva wrote:
>> On May 25, 2023, "Kewen.Lin" <[email protected]> wrote:
>>> So both lp64 and ilp32 have the same count, could we merge it and
>>> remove the selectors?
>>
>> We could, but... I thought I wouldn't, since they were different
>> before, and they're likely to diverge again in the future. I thought
>> that combining them might suggest that they ought to be the same, when
>> we already know that this is not the case.
>>
>> I'll prepare an alternate patch that combines them.
>
> Fwiw, updating the insn counts blindly like this has very small value on
> the one hand, and negative value on the other. In total, negative
> value.
>
> If it is not possible to keep these tests up-to-date easily the test
> should be improved. If tests regressed otoh we should ***not*** paper
> over that with patches like this, but investigate what happened instead:
> such regressions are *real*.
>
> So which is it here? I am assuming it is a not-to-well written testcase
> without all the necessary noipa attrs, and/or putting more than one
> thing to test per function directly. Insn counts then shift easily if
> the compiler decides to factor (CSE etc.) your code differently, but
> that is a testcase artifact then, not something we want to adjust counts
> for all of the time.
>
> It is feasible to do these insn count things only for trivial tiny
> snippets. Everything bigger will regress all of the time, no one will
> look at it properly, and instead people will just do blind "update
> counts" patches like this :-/ *Good* insn count tests are quite
> valuable, but harder to write. But maintenance costs noticably bigger
> than zero for a testcase are not good, how many testcases do we run in
> the testsuite?
>
> So, can we fix the underlying problem here please?
Thanks for all the comments and good question.
I looked into this issue and found the current counts for 32-bit are
mainly for aix (it doesn't need any updates there), and there are some
generated assembly differences between aix 32-bit and 32-bit Linux, and
it seems to be related to if compiler saves the frame pointer or not.
Take a function testbc_var from fold-vec-extract-char.p7.c as example:
#include <altivec.h>
unsigned char
testbc_var (vector bool char vbc2, signed int si)
{
return vec_extract (vbc2, si);
}
1) on aix 32-bit, with trunk:
.testbc_var:
li 9,32
addi 10,1,-64
stxvw4x 34,10,9
rlwinm 3,3,0,28,31
addi 9,3,-64 // these two lines
add 3,9,1 // can be combined, see below.
lbz 3,32(3)
blr
with old gcc (without r11-6615):
.testbc_var:
addi 10,1,-64
li 9,32
stxvw4x 34,10,9
rlwinm 3,3,0,28,31
add 3,10,3 // better
lbz 3,32(3)
blr
apparently an extra unnecessary addi is created. The test
case expects one addi to adjust stack for a temp space,
one add to prepare the index for the extracted byte.
2) same thing happens on aix 64-bit and Linux 64-bit:
trunk:
.testbc_var:
li 9,48
addi 10,1,-64
stxvw4x 34,10,9
rldicl 5,5,0,60
addi 9,5,-64 // similar to aix 32-bit
add 5,9,1 // ....
lbz 3,48(5)
blr
vs. optimized:
.testbc_var:
addi 10,1,-64
li 9,48
stxvw4x 34,10,9
rldicl 5,5,0,60
add 5,10,5 // better
lbz 3,48(5)
blr
3) but for Linux 32-bit, they are the same between trunk
and old gcc (without r11-6615):
testbc_var:
stwu 1,-48(1)
li 9,16
rlwinm 3,3,0,28,31
stxvw4x 34,1,9
add 3,1,3
lbz 3,16(3)
addi 1,1,48
blr
So the expected count adjusted for aix 32-bit broke Linux
32-bit.
As above, the behavior change (one more addi) on 64-bit and aix
32-bit results in sub-optimal code than before, but we updated
the counts previously, so I changed PR101169's component to
rtl-optimization for further investigation. From 3), what
Alexandre proposed to fix for Linux 32-bit is actually to restore
the expected count back to before. :)
BR,
Kewen