Kewen: On Fri, 2023-06-30 at 11:37 +0800, Kewen.Lin wrote: > Hi Carl, > > on 2023/6/30 05:36, Carl Love wrote: > > Kewen: > > > > On Wed, 2023-06-28 at 16:35 +0800, Kewen.Lin wrote: > > > > Yea, I was going with a runnable test and didn't include the > > > > instruction counts. Added back in. Rather then doing by > > > > processor > > > > version (P8, P9, P10) I was able to do it by BE/LE. The > > > > instruction > > > > counts were the same for LE accross processor versions but > > > > there > > > > are a > > > > few instruction counts that vary with BE and LE. > > > > > > But the original test case only checks for cpu-types (processor > > > version) > > > but not for endianness, it means for the bif usages, there should > > > not > > > be > > > different for endianness. Why does this changes with your new > > > test > > > case? > > > Could you have a further look and make it consistent with some > > > adjustment > > > if possible? As we know, checking insn counts sometimes are > > > fragile, > > > so > > > I think we should try our best to make it as robust as possible > > > in > > > the > > > first place. > > > > > > Besides, the original case also have some differences between > > > p7/p8 > > > and > > > p9. > > > > > > > There are differences on P8 LE versus BE. I did a diff between the > > P8 > > and P9 tests: > > > > diff vsx-vector-6.p8.c vsx-vector-6.p9.c > > 3,4c3,4 > > < /* { dg-require-effective-target powerpc_p8vector_ok } */ > > < /* { dg-options "-O2 -mdejagnu-cpu=power8" } */ > > --- > > > /* { dg-require-effective-target powerpc_p9vector_ok } */ > > > /* { dg-options "-O2 -mdejagnu-cpu=power9" } */ > > 12c12 > > < /* { dg-final { scan-assembler-times {\mvperm\M} 1 } } */ > > --- > > > /* { dg-final { scan-assembler-times {\m(?:v|xx)permr?\M} 1 } } > > > */ > > 23d22 > > < /* { dg-final { scan-assembler-times {\mxvmsub[am]dp\M} 1 } } */ > > 37c36 > > < /* { dg-final { scan-assembler-times {\mxvsubdp\M} 1 } } */ > > --- > > > /* { dg-final { scan-assembler-times {\mxvmsub[am]dp\M} 1 } } */ > > > > So we can see the vperm, vpermr, xxpermr, xvmsubadp, xvmsubmdp, > > xvsubdp, xvmsubadp, xvmsubmdp instruction count checks are > > different > > between the two architectures. I then wrote a script to compile > > the > > CPU specific test on Power 8, Power 9 and Power 10 architectures > > and > > then grep for the above list of instructions. If I run the scrip > > on P8 > > BE and LE I get> > > > > Power 8 BE Power 8 LE Power 9 LE Power 9 > > BE Power 10 LE* > > (makalu- > > lp1) (genoa) (marlin) (nilram) (ltcd97-lp3) > > instruction count count count count > > count > > vperm 1 1 0 0 > > 0 > > vpermr 0 0 0 0 > > 0 > > xxpermr 0 0 1 0 > > 1 > > xvmsubadp 1 0 1 1 > > 1 > > xvmsubmdp 0 1 0 0 > > 0 > > xvsubdp 1 1 1 1 > > 1 > > > > Thanks for looking into this and making this statistics. > > Is there a typo for column nilram? Otherwise, the below insn check > > /* { dg-final { scan-assembler-times {\m(?:v|xx)permr?\M} 1 } } */ > > would fail there.
Yes, there is a typo in the nilram column. The test generates a vperm instruction. #if defined (__BIG_ENDIAN__) || defined (_ARCH_PWR9) dst[8].d = vec_perm (src0[8].d, src1[8].d, src2[8].uc); f74: e9 3f 00 78 ld r9,120(r31) f78: 39 29 07 00 addi r9,r9,1792 f7c: f5 89 00 01 lxv vs12,0(r9) f80: e9 3f 00 80 ld r9,128(r31) f84: 39 29 07 00 addi r9,r9,1792 f88: f4 09 00 01 lxv vs0,0(r9) f8c: e9 3f 00 88 ld r9,136(r31) f90: 39 29 07 00 addi r9,r9,1792 f94: f4 09 00 89 lxv vs32,128(r9) f98: e9 3f 00 70 ld r9,112(r31) f9c: 39 29 07 00 addi r9,r9,1792 fa0: f0 2c 64 91 xxmr vs33,vs12 fa4: f1 a0 04 91 xxmr vs45,vs0 fa8: 10 01 68 2b vperm v0,v1,v13,v0 ... > <snip> > > > > I had played with putting -Wno-inline on the command line but that > > didn't seem to make any difference. However, you suggestion of > > __attribute__ ((noipa)) does prevent the inlining and we don't get > > the > > second copy of the instructions showing up. The inlining eliminated > > the > > LE/BE differences for xvmaxsp, xvminsp and xvmaxdp. > > -Winline is a option for warning: "Warn if a function that is > declared > as inline cannot be inlined.", I think what you wanted is -fno- > inline, > and it's good to know noipa helps here. Yea, my bad. Didn't read the manual very carefully. > > > The instruction count test for xxlor in vsx-vector-6-func-2lop.c > > differs on LE and BE vsx-vector-6-func-2op.c. I believe the > > instruction is used with loads to reorder the data. I don't see > > anyway > > to get around the extra xxlor instructions and verify the vec_or > > builtin test generates the instruction. > > > > OK, I'm still curious how the loads cause the difference. Yea, looks like there is something screwy going on. So, I started by running the test: make -j 1 && make check-gcc RUNTESTFLAGS="-v -v powerpc.exp=vsx- vector-6-func-2lop.c " > out on Makalu, P8 BE and verified the test gives 7 passes and no failures. on genoa, P8 LE, I also verified the test gives 7 passes and no failures. Then I went in an intentionally changed the expected counts down by one for each platform. The idea was to verify that the dg-final { scan- assembler-times {\mxxlor\M} was being called. on Makalu, I now get an error, as expected: heck_cached_effective_target be: returning 1 for unix is-effective-target: be 1 <<<< NOTE BE gcc.target/powerpc/vsx-vector-6-func-2lop.c: \\mxxlor\\M found 32 times FAIL: gcc.target/powerpc/vsx-vector-6-func-2lop.c scan-assembler-times \\mxxlor\\M 31 on Genoa, I now get the error, as expected: check_cached_effective_target le: returning 1 for unix is-effective-target: le 1 gcc.target/powerpc/vsx-vector-6-func-2lop.c: \\mxxlor\\M found 22 times FAIL: gcc.target/powerpc/vsx-vector-6-func-2lop.c scan-assembler-times \\mxxlor\\M 21 So, running the tests, gcc definitely thinks there should be 22 xxlor on LE and 32 on BE. So, went to look at the assembly to verify my comment on the difference being related to the loads. I decided to actually count the instructions just to verify the number in the assembly files. Before, I just looked at the assembly briefly but didn't dig in very deep. If I compile the tests and dump the assembly with: gcc -g -mcpu=power8 -o vsx-vector-6-func-2lop vsx-vector-6-func- 2lop.c objdump -S -d vsx-vector-6-func-2lop > vsx-vector-6-func-2lop.dump grep xxlor vsx-vector-6-func-2lop.dump | wc 4 28 192 So we see 4 xxlor instructions not 32 as expeced for BE or 22 as expected for LE as the test claims. I get the same count of 4 on both makalu and on genoa. I like this approach because you can easily see the relationship of the source and assembly. So, there seems to be something screwy here as that is not even close to what the make script /scan-assemblerthinks the counts should be. Segher never liked the above way of looking at the assembly. He prefers: gcc -S -g -mcpu=power8 -o vsx-vector-6-func-2lop.s vsx-vector-6-func- 2lop.c grep xxlor vsx-vector-6-func-2lop.s | wc 34 68 516 So, again, I get the same count of 34 on both makalu and genoa. But again, that doesn't agree with what make script/scan-assembler thinks the counts should be. When I looked at the vsx-vector-6-func-2lop.s I see on BE: .... lxvd2x 0,10,9 xxlor 0,12,0 xxlnor 0,0,0 ... I was guessing that it was adjusting the data layout from the load. But looking again more carefully versus LE: .... lxvd2x 0,31,9 xxpermdi 0,0,0,2 xxlor 0,12,0 xxlnor 0,0,0 xxpermdi 0,0,0,2 .... the xxpermdi is probably what is really doing the data layout change. So, we have the issue that looking at the assembly gives different instruction counts then what dg-final { scan-assembler-times {\mxxlor\M} } comes up with??? Now I am really confused. I don't know how the scan- assembler-times works but I will go see if I can find it and see if I can figure out what the issue is. I would expect that the scan- assembler is working off the --save-temp files, which get deleted as part of the run. I would guess that scan-assembler does a grep to find the instructions and then maybe uses wc to count them??? I will go see if I can figure out how scan-assembler-times works. Carl