On Wed, Nov 28, 2018 at 12:21 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > > On 11/28/18 12:48 PM, H.J. Lu wrote: > > > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubi...@ucw.cz> wrote: > > >> > > >>> On 11/5/18 7:21 AM, Jan Hubicka wrote: > > >>>>> > > >>>>> Did you mean "the nearest common dominator"? > > >>>> > > >>>> If the nearest common dominator appears in the loop while all uses are > > >>>> out of loops, this will result in suboptimal xor placement. > > >>>> In this case you want to split edges out of the loop. > > >>>> > > >>>> In general this is what the LCM framework will do for you if the > > >>>> problem > > >>>> is modelled siimlar way as in mode_swtiching. At entry function mode > > >>>> is > > >>>> "no zero register needed" and all conversions need mode "zero register > > >>>> needed". Mode switching should then do the correct placement decisions > > >>>> (reaching minimal number of executions of xor). > > >>>> > > >>>> Jeff, whan is your optinion on the approach taken by the patch? > > >>>> It seems like a special case of more general issue, but I do not see > > >>>> very elegant way to solve it at least in the GCC 9 horisont, so if > > >>>> the placement is correct we can probalby go either with new pass or > > >>>> making this part of mode swithcing (which is anyway run by x86 backend) > > >>> So I haven't followed this discussion at all, but did touch on this > > >>> issue with some patch a month or two ago with a target patch that was > > >>> trying to avoid the partial stalls. > > >>> > > >>> My assumption is that we're trying to find one or more places to > > >>> initialize the upper half of an avx register so as to avoid partial > > >>> register stall at existing sites that set the upper half. > > >>> > > >>> This sounds like a classic PRE/LCM style problem (of which mode > > >>> switching is just another variant). A common-dominator approach is > > >>> closer to a classic GCSE and is going to result is more initializations > > >>> at sub-optimal points than a PRE/LCM style. > > >> > > >> yes, it is usual code placement problem. It is special case because the > > >> zero register is not modified by the conversion (just we need to have > > >> zero somewhere). So basically we do not have kills to the zero except > > >> for entry block. > > >> > > > > > > Do you have testcase to show thatf the nearest common dominator > > > in the loop, while all uses areout of loops, leads to suboptimal xor > > > placement? > > I don't have a testcase, but it's all but certain nearest common > > dominator is going to be a suboptimal placement. That's going to create > > paths where you're going to emit the xor when it's not used. > > > > The whole point of the LCM algorithms is they are optimal in terms of > > expression evaluations. > > i think testcase should be something like > > test() > { > while (true) > { > if (cond1) > { > do_one_conversion; > return; > } > if (cond2) > { > do_other_conversion; > return; > } > } > }
We got [hjl@gnu-cfl-1 pr87007]$ cat test2.i extern float f1[],f2[]; extern int i1[],i2[]; float foo (int k, int n[]) { if (k == 1) return 1; if (k == 4) return 5; for(int i = 0; i != k; i++){ if(n[i] > 100) f1[i] = i1[i]; else f2[i] = i2[i]; } return k; } [hjl@gnu-cfl-1 pr87007]$ make test2.s /export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/ -O2 -mavx -S test2.i [hjl@gnu-cfl-1 pr87007]$ cat test2.s .file "test2.i" .text .p2align 4 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc vmovss .LC0(%rip), %xmm0 cmpl $1, %edi je .L15 vmovss .LC1(%rip), %xmm0 cmpl $4, %edi je .L15 vxorps %xmm0, %xmm0, %xmm0 testl %edi, %edi je .L3 leal -1(%rdi), %ecx xorl %eax, %eax jmp .L6 .p2align 4,,10 .p2align 3 .L17: vcvtsi2ss i1(,%rax,4), %xmm0, %xmm1 leaq 1(%rax), %rdx vmovss %xmm1, f1(,%rax,4) cmpq %rcx, %rax je .L3 .L9: movq %rdx, %rax .L6: cmpl $100, (%rsi,%rax,4) jg .L17 vcvtsi2ss i2(,%rax,4), %xmm0, %xmm1 leaq 1(%rax), %rdx vmovss %xmm1, f2(,%rax,4) cmpq %rcx, %rax jne .L9 .L3: vcvtsi2ss %edi, %xmm0, %xmm0 .L15: ret .cfi_endproc .LFE0: .size foo, .-foo .section .rodata.cst4,"aM",@progbits,4 .align 4 .LC0: .long 1065353216 .align 4 .LC1: .long 1084227584 .ident "GCC: (GNU) 9.0.0 20181230 (experimental)" .section .note.GNU-stack,"",@progbits [hjl@gnu-cfl-1 pr87007]$ The placement is optimal. -- H.J.