Matt Turner <matts...@gmail.com> writes: > On Mon, Oct 14, 2013 at 4:14 PM, Eric Anholt <e...@anholt.net> wrote: >> Previously, the best thing we had was to schedule the things unblocked by >> the current instruction, on the hope that it would be consuming two values >> at the end of their live intervals while only producing one new value. >> Sometimes that wasn't the case. >> >> Now, when an instruction is the first user of a GRF we schedule (i.e. it >> will probably be the virtual_grf_def[] instruction after computing live >> intervals again), penalize it by how many regs it would take up. When an >> instruction is the last user of a GRF we have to schedule (when it will >> probably be the virtual_grf_end[] instruction), give it a boost by how >> many regs it would free. >> >> The new functions are made virtual (only 1 of 2 really needs to be >> virtual) because I expect we'll soon lift the pre-regalloc scheduling >> heuristic over to the vec4 backend. >> >> shader-db: >> total instructions in shared programs: 1512756 -> 1511604 (-0.08%) >> instructions in affected programs: 10292 -> 9140 (-11.19%) >> GAINED: 121 >> LOST: 38 >> >> Improves tropics performance at my current settings by 4.50602% +/- >> 2.60694% (n=5). No difference on Lightsmark (n=5). No difference on >> GLB2.7 (n=11). >> >> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=70445 >> --- > > I think we're on the right track by considering register pressure when > scheduling, but one aspect we're not considering is simply how many > registers we think we're using. > > If I understand correctly, the pre-register allocation wants to > shorten live intervals as much as possible which reduces register > pressure but at the cost of larger stalls and less instruction level > parallelism. We end up scheduling things like > > produce result 4 > produce result 3 > produce result 2 > produce result 1 > use result 1 > use result 2 > use result 3 > use result 4 > > (this is why the MRF writes for the FB write are always done in the > reverse order) > > Take the main shader from FillTestC24Z16 in GLB2.5 or 2.7 as an > example. Before texture-grf we serialized the eight texture sends. > After that branch landed, we scheduled the code much better, leading > to a performance improvement. > > This patch causes us again to serialize the 8 texture ops in > GLB25_FillTestC24Z16, like we did before texture-from-grf. It reduces > performance from 7.0 billion texels/sec to ~6.5 on IVB.
This is mostly a problem, as far as I can see, of unfortunate GRF choices between the send sources and dests. I haven't seen an easy way out of that beyond what we're doing with the round_robin flag in the register allocator already, so let's play with scheduling some more for the moment... > Can we accurately track the number of registers in use and decide what > to do based on that? An attempt to do this is on betterthanlifo-3 of my tree. The quick results: total instructions in shared programs: 1599565 -> 1599757 (0.01%) instructions in affected programs: 2014 -> 2206 (9.53%) GAINED: 22 LOST: 110 That's not at all what I hoped for. But maybe the problem is that we end up faced with a ton of multiplies of components of texture results and we don't know which one we should pick next once we've picked one of them? Maybe if we give a higher weight to things that will help finish off a VGRF's use? I present betterthanlifo-6: anholt@eliezer:anholt/src/shader-db% ./report.py sched-lifo3 sched-lifo6 total instructions in shared programs: 1606060 -> 1606060 (0.00%) instructions in affected programs: 0 -> 0 GAINED: 0 LOST: 0 Well that wasn't the result I was expecting. But it kinda makes sense: Once we've scheduled processing of .x, the next thing we'll probably choose even in the absence of weighting is .y, not some *other* texture which had been inserted into the list at a totally separate time. Looking at performance going from betterthanlifo-2 to betterthanlifo-3: GLB2.7: 1.39845% +/- 0.797931% (n=15/16) lm: No difference (n=3) minecraft: No difference (n=10) tropics: -4.12118% +/- 2.48834% (n=4) nexuiz: No difference (n=8) openarena: -1.46747% +/- 1.08201% (n=110) At this point I think I want to go forward with -2 (this patch) as opposed to -3. (Note: Results presented in this thread, after the original patch posting, are on top of glsl-cse, trying to reduce the significance of that one crazy Tropics shader that spawned all this flailing about in register allocation).
pgpN5P2mxR8Fk.pgp
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev