Hi, I am trying to improve register spilling for fp64 programs. Specifically for the varying-packing-simple piglit test with double types. Because this test uses all available varying slots, register pressure is significant and spilling is necessary for it to pass, even for non-fp64 types.
The main obstacle for this to work with fp64 types is that the current register spilling process discards registers that are not contiguous (stride != 1) and fp64 needs to use strides of 2 all the time. According to the comment in brw_fs_reg_allocate.cpp, this restriction is to avoid generating bad assembly for smeared registers (stride = 0), so in theory we should be able to just change the condition to only disallow spilling of registers with stride 0. That works, mostly. Specifically, we no longer fail to compile and the test passes for a number of varyings up to ~120. Up to this point, it seems the test only needs to spill registers that write with a stride of 2 and read with a stride of 1, which seems to work fine. However, beyond that point, we need to spill registers that are read with a stride of 2, and that fails consistently. Here is a minimal sample that fails: 0: add(8) vgrf1:D, g2<0>:D, 1d 1: mov(8) vgrf5:DF, vgrf1:D 2: gen4_scratch_write(8) (mlen: 2) null:F, vgrf5+0.0:F (offset = 0) 3: gen4_scratch_write(8) (mlen: 2) null:F, vgrf5+1.0:F (offset = 32) 4: mov(8) vgrf3+0.0:UD, g1:UD NoMask WE_all 5: mov(8) vgrf3+1.0:F, g3:F 6: mov(8) vgrf3+2.0:F, g4:F 7: mov(8) vgrf3+3.0:F, g5:F 8: mov(8) vgrf3+4.0:F, g6:F 9: gen7_scratch_read(8) vgrf6+0.0:F, (offset = 0) 10: gen7_scratch_read(8) vgrf6+1.0:F, (offset = 32) 11: mov(8) vgrf3+5.0:F, vgrf6<2>:F 12: gen7_scratch_read(8) vgrf7+0.0:F, (offset = 0) 13: gen7_scratch_read(8) vgrf7+1.0:F, (offset = 32) 14: mov(8) vgrf3+6.0:F, vgrf7+0.4<2>:F 15: gen8_urb_write_simd8(8) (mlen: 9) (null):UD, vgrf3:F Since DF registers take twice as much space, the spilling code needs 2 writes and 2 reads for each spill/unspill. The final assembly for the reads looks like this: send(8) g8<1>UW g0<8,8,1>F data ( DC OWORD block read, 0, 0) mlen 1 rlen 1 { align1 1Q }; send(8) g9<1>UW g0<8,8,1>F data ( DC OWORD block read, 1, 0) mlen 1 rlen 1 { align1 1Q }; mov(8) g122<1>F g8<8,4,2>F { align1 1Q }; send(8) g9<1>UW g0<8,8,1>F data ( DC OWORD block read, 0, 0) mlen 1 rlen 1 { align1 1Q }; send(8) g10<1>UW g0<8,8,1>F data ( DC OWORD block read, 1, 0) mlen 1 rlen 1 { align1 1Q }; mov(8) g123<1>F g9.1<8,4,2>F { align1 1Q }; send(8) null<1>F g117<8,8,1>F urb 1 SIMD8 write mlen 9 rlen 0 { align1 1Q EOT }; All this looks correct to me, so I was wondering if the issue here could be related to the hardware not seeing that it needs both scratch reads to complete before it emits the MOV that reads vgrf6<2>:F. I have tried various strategies to force a dependency between the strided read and the result of both scratch reads (mostly adding code that uses both scratch reads results before we do the strided MOV) but nothing seems to make any difference, so the problem might have nothing to do with this in the end. If this is not the problem, then I don't see what could be causing something like this to fail... Any ideas as to what could be going on here? Thanks, Iago _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev