Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency

H.J. Lu Sun, 30 Dec 2018 08:51:23 -0800

On Wed, Nov 28, 2018 at 12:21 PM Jan Hubicka <hubi...@ucw.cz> wrote:
>
> > On 11/28/18 12:48 PM, H.J. Lu wrote:
> > > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubi...@ucw.cz> wrote:
> > >>
> > >>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> > >>>>>
> > >>>>> Did you mean "the nearest common dominator"?
> > >>>>
> > >>>> If the nearest common dominator appears in the loop while all uses are
> > >>>> out of loops, this will result in suboptimal xor placement.
> > >>>> In this case you want to split edges out of the loop.
> > >>>>
> > >>>> In general this is what the LCM framework will do for you if the 
> > >>>> problem
> > >>>> is modelled siimlar way as in mode_swtiching.  At entry function mode 
> > >>>> is
> > >>>> "no zero register needed" and all conversions need mode "zero register
> > >>>> needed".  Mode switching should then do the correct placement decisions
> > >>>> (reaching minimal number of executions of xor).
> > >>>>
> > >>>> Jeff, whan is your optinion on the approach taken by the patch?
> > >>>> It seems like a special case of more general issue, but I do not see
> > >>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> > >>>> the placement is correct we can probalby go either with new pass or
> > >>>> making this part of mode swithcing (which is anyway run by x86 backend)
> > >>> So I haven't followed this discussion at all, but did touch on this
> > >>> issue with some patch a month or two ago with a target patch that was
> > >>> trying to avoid the partial stalls.
> > >>>
> > >>> My assumption is that we're trying to find one or more places to
> > >>> initialize the upper half of an avx register so as to avoid partial
> > >>> register stall at existing sites that set the upper half.
> > >>>
> > >>> This sounds like a classic PRE/LCM style problem (of which mode
> > >>> switching is just another variant).   A common-dominator approach is
> > >>> closer to a classic GCSE and is going to result is more initializations
> > >>> at sub-optimal points than a PRE/LCM style.
> > >>
> > >> yes, it is usual code placement problem. It is special case because the
> > >> zero register is not modified by the conversion (just we need to have
> > >> zero somewhere).  So basically we do not have kills to the zero except
> > >> for entry block.
> > >>
> > >
> > > Do you have  testcase to show thatf the nearest common dominator
> > > in the loop, while all uses areout of loops, leads to suboptimal xor
> > > placement?
> > I don't have a testcase, but it's all but certain nearest common
> > dominator is going to be a suboptimal placement.  That's going to create
> > paths where you're going to emit the xor when it's not used.
> >
> > The whole point of the LCM algorithms is they are optimal in terms of
> > expression evaluations.
>
> i think testcase should be something like
>
> test()
> {
>   while (true)
>   {
>      if (cond1)
>        {
>          do_one_conversion;
>          return;
>        }
>      if (cond2)
>        {
>          do_other_conversion;
>          return;
>        }
>   }
> }


We got

[hjl@gnu-cfl-1 pr87007]$ cat test2.i
extern float f1[],f2[];
extern int i1[],i2[];
float
foo (int k, int n[])
{
  if (k == 1)
    return 1;

  if (k == 4)
    return 5;

  for(int i = 0; i != k; i++){
    if(n[i] > 100)
      f1[i] = i1[i];
    else
      f2[i] = i2[i];
  }

  return k;
}
[hjl@gnu-cfl-1 pr87007]$ make test2.s
/export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/ -O2
-mavx -S test2.i
[hjl@gnu-cfl-1 pr87007]$ cat test2.s
.file "test2.i"
.text
.p2align 4
.globl foo
.type foo, @function
foo:
.LFB0:
.cfi_startproc
vmovss .LC0(%rip), %xmm0
cmpl $1, %edi
je .L15
vmovss .LC1(%rip), %xmm0
cmpl $4, %edi
je .L15
vxorps %xmm0, %xmm0, %xmm0
testl %edi, %edi
je .L3
leal -1(%rdi), %ecx
xorl %eax, %eax
jmp .L6
.p2align 4,,10
.p2align 3
.L17:
vcvtsi2ss i1(,%rax,4), %xmm0, %xmm1
leaq 1(%rax), %rdx
vmovss %xmm1, f1(,%rax,4)
cmpq %rcx, %rax
je .L3
.L9:
movq %rdx, %rax
.L6:
cmpl $100, (%rsi,%rax,4)
jg .L17
vcvtsi2ss i2(,%rax,4), %xmm0, %xmm1
leaq 1(%rax), %rdx
vmovss %xmm1, f2(,%rax,4)
cmpq %rcx, %rax
jne .L9
.L3:
vcvtsi2ss %edi, %xmm0, %xmm0
.L15:
ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.section .rodata.cst4,"aM",@progbits,4
.align 4
.LC0:
.long 1065353216
.align 4
.LC1:
.long 1084227584
.ident "GCC: (GNU) 9.0.0 20181230 (experimental)"
.section .note.GNU-stack,"",@progbits
[hjl@gnu-cfl-1 pr87007]$

The placement is optimal.

-- 
H.J.

Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency

Reply via email to