Re: [PATCH] rs6000: Update the vsx-vector-6.* tests.

Carl Love via Gcc-patches Fri, 30 Jun 2023 15:21:02 -0700

Kewen:

On Fri, 2023-06-30 at 11:37 +0800, Kewen.Lin wrote:
> Hi Carl,
> 
> on 2023/6/30 05:36, Carl Love wrote:
> > Kewen:
> > 
> > On Wed, 2023-06-28 at 16:35 +0800, Kewen.Lin wrote:
> > > > Yea, I was going with a runnable test and didn't include the
> > > > instruction counts.  Added back in.  Rather then doing by
> > > > processor
> > > > version (P8, P9, P10) I was able to do it by BE/LE.  The
> > > > instruction
> > > > counts were the same for LE accross processor versions but
> > > > there
> > > > are a
> > > > few instruction counts that vary with BE and LE.
> > > 
> > > But the original test case only checks for cpu-types (processor
> > > version)
> > > but not for endianness, it means for the bif usages, there should
> > > not
> > > be
> > > different for endianness.  Why does this changes with your new
> > > test
> > > case?
> > > Could you have a further look and make it consistent with some
> > > adjustment
> > > if possible?  As we know, checking insn counts sometimes are
> > > fragile,
> > > so
> > > I think we should try our best to make it as robust as possible
> > > in
> > > the
> > > first place.
> > > 
> > > Besides, the original case also have some differences between
> > > p7/p8
> > > and
> > > p9.
> > >   
> > 
> > There are differences on P8 LE versus BE.  I did a diff between the
> > P8
> > and P9 tests:
> > 
> >  diff vsx-vector-6.p8.c vsx-vector-6.p9.c
> > 3,4c3,4
> > < /* { dg-require-effective-target powerpc_p8vector_ok } */
> > < /* { dg-options "-O2 -mdejagnu-cpu=power8" } */
> > ---
> > > /* { dg-require-effective-target powerpc_p9vector_ok } */
> > > /* { dg-options "-O2 -mdejagnu-cpu=power9" } */
> > 12c12
> > < /* { dg-final { scan-assembler-times {\mvperm\M} 1 } } */
> > ---
> > > /* { dg-final { scan-assembler-times {\m(?:v|xx)permr?\M} 1 } }
> > > */
> > 23d22
> > < /* { dg-final { scan-assembler-times {\mxvmsub[am]dp\M} 1 } } */
> > 37c36
> > < /* { dg-final { scan-assembler-times {\mxvsubdp\M} 1 } } */
> > ---
> > > /* { dg-final { scan-assembler-times {\mxvmsub[am]dp\M} 1 } } */
> > 
> > So we can see the vperm, vpermr, xxpermr, xvmsubadp, xvmsubmdp,
> > xvsubdp, xvmsubadp, xvmsubmdp instruction count checks are
> > different
> > between the two architectures.  I then wrote a script to compile
> > the
> > CPU specific test on Power 8, Power 9 and Power 10 architectures
> > and
> > then grep for the above list of instructions.  If I run the scrip
> > on P8
> > BE  and LE I get> 
> > 
> >             Power 8 BE    Power 8 LE   Power 9 LE   Power 9
> > BE    Power 10 LE*
> >            (makalu-
> > lp1)    (genoa)     (marlin)      (nilram)   (ltcd97-lp3)
> > instruction   count         count        count         count       
> >  count
> > vperm          1              1            0             0         
> >    0
> > vpermr         0              0            0             0         
> >    0
> > xxpermr        0              0            1             0         
> >    1
> > xvmsubadp      1              0            1             1         
> >    1
> > xvmsubmdp      0              1            0             0         
> >    0
> > xvsubdp        1              1            1             1         
> >    1
> > 
> 
> Thanks for looking into this and making this statistics.
> 
> Is there a typo for column nilram?   Otherwise, the below insn check
> 
> /* { dg-final { scan-assembler-times {\m(?:v|xx)permr?\M} 1 } } */
> 
> would fail there.


Yes, there is a typo in the nilram column.  The test generates a vperm
instruction.

#if defined (__BIG_ENDIAN__) || defined (_ARCH_PWR9)
  dst[8].d = vec_perm (src0[8].d, src1[8].d, src2[8].uc);
     f74:       e9 3f 00 78     ld      r9,120(r31)
     f78:       39 29 07 00     addi    r9,r9,1792
     f7c:       f5 89 00 01     lxv     vs12,0(r9)
     f80:       e9 3f 00 80     ld      r9,128(r31)
     f84:       39 29 07 00     addi    r9,r9,1792
     f88:       f4 09 00 01     lxv     vs0,0(r9)
     f8c:       e9 3f 00 88     ld      r9,136(r31)
     f90:       39 29 07 00     addi    r9,r9,1792
     f94:       f4 09 00 89     lxv     vs32,128(r9)
     f98:       e9 3f 00 70     ld      r9,112(r31)
     f9c:       39 29 07 00     addi    r9,r9,1792
     fa0:       f0 2c 64 91     xxmr    vs33,vs12
     fa4:       f1 a0 04 91     xxmr    vs45,vs0
     fa8:       10 01 68 2b     vperm   v0,v1,v13,v0
     ...

> <snip>

> > 
> > I had played with putting -Wno-inline on the command line but that
> > didn't seem to make any difference.  However, you suggestion of
> > __attribute__ ((noipa)) does prevent the inlining and we don't get
> > the
> > second copy of the instructions showing up. The inlining eliminated
> > the
> > LE/BE differences for xvmaxsp, xvminsp and xvmaxdp.
> 
> -Winline is a option for warning: "Warn if a function that is
> declared
> as inline cannot be inlined.", I think what you wanted is -fno-
> inline,
> and it's good to know noipa helps here.

Yea, my bad.  Didn't read the manual very carefully.  
> 
> > The instruction count test for xxlor in vsx-vector-6-func-2lop.c
> > differs on LE and BE vsx-vector-6-func-2op.c.  I believe the
> > instruction is used with loads to reorder the data.  I don't see
> > anyway
> > to get around the extra xxlor instructions and verify the vec_or
> > builtin test generates the instruction.
> > 
> 
> OK, I'm still curious how the loads cause the difference.

Yea, looks like there is something screwy going on.  So, I started by
running the test:

 make -j 1 && make check-gcc RUNTESTFLAGS="-v -v powerpc.exp=vsx-
vector-6-func-2lop.c " > out

on Makalu, P8 BE and verified the test gives 7 passes and no failures.

on genoa, P8 LE, I also verified the test gives 7 passes and no
failures.

Then I went in an intentionally changed the expected counts down by one
for each platform.  The idea was to verify that the dg-final { scan-
assembler-times {\mxxlor\M} was being called.

on Makalu, I now get an error, as expected:

heck_cached_effective_target be: returning 1 for unix
is-effective-target: be 1                                <<<< NOTE BE
gcc.target/powerpc/vsx-vector-6-func-2lop.c: \\mxxlor\\M found 32 times
FAIL: gcc.target/powerpc/vsx-vector-6-func-2lop.c scan-assembler-times
\\mxxlor\\M 31

on Genoa, I now get the error, as expected:

check_cached_effective_target le: returning 1 for
unix                          
is-effective-target: le
1                                                       
gcc.target/powerpc/vsx-vector-6-func-2lop.c: \\mxxlor\\M found 22
times         
FAIL: gcc.target/powerpc/vsx-vector-6-func-2lop.c scan-assembler-times
\\mxxlor\\M 21   

So, running the tests, gcc definitely thinks there should be 22 xxlor
on LE and 32 on BE.  

So, went to look at the assembly to verify my comment on the difference
being related to the loads. I decided to actually count the
instructions just to verify the number in the assembly files.  Before,
I just looked at the assembly briefly but didn't dig in very deep.

If I compile the tests and dump the assembly with:
  gcc -g -mcpu=power8 -o vsx-vector-6-func-2lop vsx-vector-6-func-
2lop.c

  objdump -S -d vsx-vector-6-func-2lop > vsx-vector-6-func-2lop.dump
  
  grep xxlor vsx-vector-6-func-2lop.dump | wc
      4      28     192

So we see 4 xxlor instructions not 32 as expeced for BE or 22 as
expected for LE as the test claims.  I get the same count of 4 on both
makalu and on genoa.  I like this approach because you can easily see
the relationship of the source and assembly.  So, there seems to be
something screwy here as that is not even close to what the make script
/scan-assemblerthinks the counts should be.

Segher never liked the above way of looking at the assembly.  He
prefers:
  gcc -S -g -mcpu=power8 -o vsx-vector-6-func-2lop.s vsx-vector-6-func-
2lop.c

  grep xxlor vsx-vector-6-func-2lop.s | wc
     34      68     516

So, again, I get the same count of 34 on both makalu and genoa.  But
again, that doesn't agree with what make script/scan-assembler thinks
the counts should be.

When I looked at the vsx-vector-6-func-2lop.s I see on BE:

     ....
    lxvd2x 0,10,9
    xxlor 0,12,0
    xxlnor 0,0,0
     ...

I was guessing that it was adjusting the data layout from the load. 
But looking again more carefully versus LE:

    ....
    lxvd2x 0,31,9 
   xxpermdi 0,0,0,2 
   xxlor 0,12,0  
   xxlnor 0,0,0  
   xxpermdi 0,0,0,2     
    ....

the xxpermdi is probably what is really doing the data layout change.

So, we have the issue that looking at the assembly gives different
instruction counts then what 

   dg-final { scan-assembler-times {\mxxlor\M} }

comes up with???  Now I am really confused.  I don't know how the scan-
assembler-times works but I will go see if I can find it and see if I
can figure out what the issue is.  I would expect that the scan-
assembler is working off the --save-temp files, which get deleted as
part of the run.  I would guess that scan-assembler does a grep to find
the instructions and then maybe uses wc to count them??? I will go see
if I can figure out how scan-assembler-times works.

                          Carl

Re: [PATCH] rs6000: Update the vsx-vector-6.* tests.

Reply via email to