On Mon, Oct 14, 2013 at 4:14 PM, Eric Anholt <e...@anholt.net> wrote: > Previously, the best thing we had was to schedule the things unblocked by > the current instruction, on the hope that it would be consuming two values > at the end of their live intervals while only producing one new value. > Sometimes that wasn't the case. > > Now, when an instruction is the first user of a GRF we schedule (i.e. it > will probably be the virtual_grf_def[] instruction after computing live > intervals again), penalize it by how many regs it would take up. When an > instruction is the last user of a GRF we have to schedule (when it will > probably be the virtual_grf_end[] instruction), give it a boost by how > many regs it would free. > > The new functions are made virtual (only 1 of 2 really needs to be > virtual) because I expect we'll soon lift the pre-regalloc scheduling > heuristic over to the vec4 backend. > > shader-db: > total instructions in shared programs: 1512756 -> 1511604 (-0.08%) > instructions in affected programs: 10292 -> 9140 (-11.19%) > GAINED: 121 > LOST: 38 > > Improves tropics performance at my current settings by 4.50602% +/- > 2.60694% (n=5). No difference on Lightsmark (n=5). No difference on > GLB2.7 (n=11). > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=70445 > ---
I think we're on the right track by considering register pressure when scheduling, but one aspect we're not considering is simply how many registers we think we're using. If I understand correctly, the pre-register allocation wants to shorten live intervals as much as possible which reduces register pressure but at the cost of larger stalls and less instruction level parallelism. We end up scheduling things like produce result 4 produce result 3 produce result 2 produce result 1 use result 1 use result 2 use result 3 use result 4 (this is why the MRF writes for the FB write are always done in the reverse order) Take the main shader from FillTestC24Z16 in GLB2.5 or 2.7 as an example. Before texture-grf we serialized the eight texture sends. After that branch landed, we scheduled the code much better, leading to a performance improvement. This patch causes us again to serialize the 8 texture ops in GLB25_FillTestC24Z16, like we did before texture-from-grf. It reduces performance from 7.0 billion texels/sec to ~6.5 on IVB. The shader in question is structured, prior to scheduling as 16 PLNs to interpolate the texture coordinates - 10 registers consumed, 16 results produced 8 TEX - 16 registers consumed, 32 results produced 28 ADDs to sum the texture results into gl_FragColor. - 32 registers consumed, 4 results produced FB write. - 4 registers consumed Even doubling these numbers for SIMD16 we don't spill. There's no need to reduce live ranges and therefore ILP for this shader. Can we accurately track the number of registers in use and decide what to do based on that? _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev