2015-05-06 17:18 GMT+03:00 Ilya Enkovich <enkovich....@gmail.com>: > 2015-04-25 4:32 GMT+03:00 Jan Hubicka <hubi...@ucw.cz>: >> Hi, >> I am adding Vladimir and Richard into CC. I tried to solve similar problem >> with FP math years ago by having -mfpmath=sse,i387. The idea was to allow >> use of i387 registers when SSE ones run out and possibly also model the fact >> that Pentium4 had faster i387 additions than SSE additions. I also had some >> plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never >> got to that. >> >> This did not really fly becuase of the regalloc not really being able to >> understnad it (I made path to regclass to propagate the classes and figure >> out >> what operations needs to stay in i387 and what in SSE to avoid reloading, but >> that never got in). >> >> I believe Vladimir did some work on this with IRA (he is able to spill GPR >> regs into SSE and do bit of other tricks). >> >> Also I believe it was kind of Richard's design deicsion to avoid use of >> (paradoxical) subregs for vector conversions because these have funny >> implications. >> >> The code for handling upper parts of paradoxical subregs is controlled by >> macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle >> V1DI->V2DI conversions fluently without some middle-end hacking. (it will >> probably try to produce zero extensions) >> >> When we are on SSE instructions, it would be great to finally teach >> copy_by_pieces/store_by_pieces to use vector instructions (these are more >> compact and either equaly fast or faster on some CPUs). I hope to get into >> this, but it would be great if someone beat me. >> >> Honza >> > > I'm trying to implement it as separate RTL pass which chooses a > scalar/vector mode for each 64bit computation chain and performs > transformation if we choose to use vectors. I also want to split DI > instructions which are going to be implemented on GPRs before RA > (currently it is done on the second split). Good metrics for such > transformation is a big question but currently I can't even make it > generate correct code when paradoxical subregs are used. It works in > simple cases but I get troubles when spills appear. > > Trying to beat the following testcase: > > test (long long *arr) > { > register unsigned long long tmp; > tmp = arr[0] | arr[1] & arr[2]; > while (tmp) > { > counter (tmp); > tmp = *(arr++) & tmp; > } > } > > RTL I generate seems OK to me (ignoring the fact that it is not optimal): > > (insn 6 3 50 2 (set (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ]) > (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) > (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D) > + 8B]+0 S8 A64])) pr65105-1.c:22 89 {*movdi_internal} > (nil)) > (insn 50 6 7 2 (set (reg:DI 104) > (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) > (const_int 16 [0x10])) [2 MEM[(long long int > *)arr_5(D) + 16B]+0 S8 A64])) pr65105-1.c:22 -1 > (nil)) > (insn 7 50 51 2 (set (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) > (and:V2DI (subreg:V2DI (reg:DI 98 [ MEM[(long long int > *)arr_5(D) + 8B] ]) 0) > (subreg:V2DI (reg:DI 104) 0))) pr65105-1.c:22 3487 {*andv2di3} > (expr_list:REG_DEAD (subreg:V2DI (reg:DI 98 [ MEM[(long long int > *)arr_5(D) + 8B] ]) 0) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (expr_list:REG_EQUAL (and:DI (mem:DI (plus:SI (reg/v/f:SI > 96 [ arr ]) > (const_int 8 [0x8])) [2 MEM[(long long int > *)arr_5(D) + 8B]+0 S8 A64]) > (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) > (const_int 16 [0x10])) [2 MEM[(long long > int *)arr_5(D) + 16B]+0 S8 A64])) > (nil))))) > (insn 51 7 8 2 (set (reg:DI 105) > (mem:DI (reg/v/f:SI 96 [ arr ]) [2 *arr_5(D)+0 S8 A64])) > pr65105-1.c:22 -1 > (nil)) > (insn 8 51 46 2 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0) > (ior:V2DI (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) > (subreg:V2DI (reg:DI 105) 0))) pr65105-1.c:22 3489 {*iorv2di3} > (expr_list:REG_DEAD (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil)))) > (insn 46 8 47 2 (set (reg:V2DI 103) > (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:22 -1 > (nil)) > (insn 47 46 48 2 (set (subreg:SI (reg:DI 101) 0) > (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1 > (nil)) > (insn 48 47 49 2 (set (reg:V2DI 103) > (lshiftrt:V2DI (reg:V2DI 103) > (const_int 32 [0x20]))) pr65105-1.c:22 -1 > (nil)) > (insn 49 48 9 2 (set (subreg:SI (reg:DI 101) 4) > (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1 > (nil)) > (note 9 49 10 2 NOTE_INSN_DELETED) > (insn 10 9 11 2 (parallel [ > (set (reg:CCZ 17 flags) > (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4) > (subreg:SI (reg:DI 101) 0)) > (const_int 0 [0]))) > (clobber (scratch:SI)) > ]) pr65105-1.c:23 447 {*iorsi_3} > (nil)) > (jump_insn 11 10 37 2 (set (pc) > (if_then_else (ne (reg:CCZ 17 flags) > (const_int 0 [0])) > (label_ref:SI 37) > (pc))) pr65105-1.c:23 619 {*jcc_1} > (expr_list:REG_DEAD (reg:CCZ 17 flags) > (int_list:REG_BR_PROB 9100 (nil))) > -> 37) > (code_label 37 11 36 3 11 "" [2 uses]) > (note 36 37 18 3 [bb 3] NOTE_INSN_BASIC_BLOCK) > (insn 18 36 19 3 (set (mem:DI (reg/f:SI 7 sp) [0 S8 A32]) > (reg/v:DI 87 [ tmp ])) pr65105-1.c:25 89 {*movdi_internal} > (nil)) > (call_insn 19 18 20 3 (call (mem:QI (symbol_ref:SI ("counter") [flags > 0x3] <function_decl 0x7f94046ea798 counter>) [0 counter S1 A8]) > (const_int 8 [0x8])) pr65105-1.c:25 666 {*call} > (expr_list:REG_CALL_DECL (symbol_ref:SI ("counter") [flags 0x3] > <function_decl 0x7f94046ea798 counter>) > (expr_list:REG_EH_REGION (const_int 0 [0]) > (nil))) > (expr_list:DI (use (mem:DI (reg/f:SI 7 sp) [0 S8 A32])) > (nil))) > (insn 20 19 52 3 (parallel [ > (set (reg/v/f:SI 96 [ arr ]) > (plus:SI (reg/v/f:SI 96 [ arr ]) > (const_int 8 [0x8]))) > (clobber (reg:CC 17 flags)) > ]) pr65105-1.c:26 220 {*addsi_1} > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil))) > (insn 52 20 21 3 (set (reg:DI 106) > (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) > (const_int -8 [0xfffffffffffffff8])) [2 MEM[base: > arr_14, offset: 4294967288B]+0 S8 A64])) pr65105-1.c:26 -1 > (nil)) > (insn 21 52 42 3 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0) > (and:V2DI (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0) > (subreg:V2DI (reg:DI 106) 0))) pr65105-1.c:26 3487 {*andv2di3} > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil))) > (insn 42 21 43 3 (set (reg:V2DI 102) > (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:26 -1 > (nil)) > (insn 43 42 44 3 (set (subreg:SI (reg:DI 101) 0) > (subreg:SI (reg:V2DI 102) 0)) pr65105-1.c:26 -1 > (nil)) > (insn 44 43 45 3 (set (reg:V2DI 102) > (lshiftrt:V2DI (reg:V2DI 102) > (const_int 32 [0x20]))) pr65105-1.c:26 -1 > (nil)) > (insn 45 44 23 3 (set (subreg:SI (reg:DI 101) 4) > (subreg:SI (reg:V2DI 102) 0)) pr65105-1.c:26 -1 > (nil)) > (note 23 45 24 3 NOTE_INSN_DELETED) > (insn 24 23 25 3 (parallel [ > (set (reg:CCZ 17 flags) > (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4) > (subreg:SI (reg:DI 101) 0)) > (const_int 0 [0]))) > (clobber (scratch:SI)) > ]) pr65105-1.c:23 447 {*iorsi_3} > (nil)) > (jump_insn 25 24 30 3 (set (pc) > (if_then_else (ne (reg:CCZ 17 flags) > (const_int 0 [0])) > (label_ref:SI 37) > (pc))) pr65105-1.c:23 619 {*jcc_1} > (expr_list:REG_DEAD (reg:CCZ 17 flags) > (int_list:REG_BR_PROB 9100 (nil))) > -> 37) > > > r87 [tmp] has one definition before the loop (insn 8) and one > definition in the loop (insn 21). But after reload I see that insn 8 > result is stored into stack and this stored value is used in the loop. > But value produced in in 21 is not stored into stack and therefore > wrong value is used starting from the second loop iteration. Here is > the resulting assembler: > > test: > .LFB10: > .cfi_startproc > pushl %ebx > .cfi_def_cfa_offset 8 > .cfi_offset 3, -8 > leal -40(%esp), %esp > .cfi_def_cfa_offset 48 > movl 48(%esp), %ebx > movq 8(%ebx), %xmm1 > movq 16(%ebx), %xmm0 > pand %xmm1, %xmm0 > movq (%ebx), %xmm1 > movdqa %xmm0, %xmm4 > por %xmm1, %xmm4 > movdqa %xmm4, %xmm0 > movd %xmm4, %edx > **movq %xmm4, 16(%esp)** > psrlq $32, %xmm0 > movd %xmm0, %eax > orl %edx, %eax > je .L7 > .p2align 4,,15 > .L11: > **movl 16(%esp), %eax** > addl $8, %ebx > **movl 20(%esp), %edx** > movl %eax, (%esp) > movl %edx, 4(%esp) > call counter > movq -8(%ebx), %xmm0 > **movdqa 16(%esp), %xmm2** > pand %xmm0, %xmm2 > movdqa %xmm2, %xmm0 > movd %xmm2, %edx > psrlq $32, %xmm0 > movd %xmm0, %eax > orl %edx, %eax > jne .L11 > .L7: > leal 40(%esp), %esp > .cfi_def_cfa_offset 8 > popl %ebx > .cfi_restore 3 > .cfi_def_cfa_offset 4 > ret > > Do I misuse paradoxical subregs? Is there any other way to mix scalar > and vector code and perform vector casts? > > BTW this test works OK on another optset when r87 is not spilled into > a memory but is preserved on GPRs through the call instead. > > Thanks, > Ilya
Hi Vladimir, Could you please comment on this? Thanks, Ilya