On Wednesday, December 14, 2016 2:18:16 PM PST Francisco Jerez wrote: > Francisco Jerez <curroje...@riseup.net> writes: > > > Kenneth Graunke <kenn...@whitecape.org> writes: > > > >> On Friday, December 9, 2016 11:03:29 AM PST Francisco Jerez wrote: > >>> Asking the DC for less than one cacheline (4 owords) of data for > >>> uniform pull constants is suboptimal because the DC cannot request > >>> less than that from L3, resulting in wasted bandwidth and unnecessary > >>> message dispatch overhead, and exacerbating the IVB L3 serialization > >>> bug. The following table summarizes the overall framerate improvement > >>> (with statistical significance of 5% and sample size ~10) from the > >>> whole series up to this patch for several benchmarks and hardware > >>> generations: > >>> > >>> | SKL | BDW | HSW > >>> SynMark2 OglShMapPcf | 24.63% ±0.45% | 4.01% ±0.70% | 10.31% ±0.38% > >>> GfxBench4 gl_manhattan31 | 5.93% ±0.35% | 3.92% ±0.31% | 6.62% ±0.22% > >>> GfxBench4 gl_4 | 2.52% ±0.44% | 1.23% ±0.10% | N/A > >>> Unigine Valley | 0.83% ±0.17% | 0.23% ±0.05% | 0.74% ±0.45% > >> > >> I suspect OglShMapPcf gained SIMD16 on Skylake due to reduced register > >> pressure, from the lower message lengths on pull loads. (At least, it > >> did when I had a series to fix that.) That's probably a large portion > >> of the performance improvement here, and why it's so much larger for > >> that workload on Skylake specifically. It might be worth mentioning it > >> in your commit message here. > >> > > > > Yeah, that matches my understanding too. I'll add some shader-db stats > > in order to illustrate the effect of this on register pressure, as you > > asked me to do in your previous reply. > > > > FTR, here is a summary of the effect of this series on several shader-db > stats. As you can see the register pressure benefit on SKL+ is > substantial: > > Lost->Gained Total instructions Total cycles > Total spills Total fills > BWR: 5 -> 5 4571248 -> 4568342 (-0.06%) 123375740 -> 123373296 (-0.00%) > 1488 -> 1488 (0.00%) 1957 -> 1957 (0.00%) > ELK: 5 -> 5 3989020 -> 3985402 (-0.09%) 98757068 -> 98754058 (-0.00%) > 1489 -> 1489 (0.00%) 1958 -> 1958 (0.00%) > ILK: 1 -> 4 6383591 -> 6376787 (-0.11%) 143649910 -> 143648914 (-0.00%) > 1449 -> 1449 (0.00%) 1921 -> 1921 (0.00%) > SNB: 0 -> 0 7528395 -> 7501446 (-0.36%) 103503796 -> 102460370 (-1.01%) > 549 -> 549 (0.00%) 52 -> 52 (0.00%) > IVB: 13 -> 3 6949221 -> 6943317 (-0.08%) 60592262 -> 60584422 (-0.01%) > 1271 -> 1271 (0.00%) 1162 -> 1162 (0.00%) > HSW: 11 -> 0 6409753 -> 6403702 (-0.09%) 60609070 -> 60604414 (-0.01%) > 1271 -> 1271 (0.00%) 1162 -> 1162 (0.00%) > BDW: 12 -> 0 8043467 -> 7976364 (-0.83%) 68427730 -> 68483042 (0.08%) > 1340 -> 1340 (0.00%) 1452 -> 1452 (0.00%) > CHV: 12 -> 0 8045019 -> 7977916 (-0.83%) 68297426 -> 68352756 (0.08%) > 1340 -> 1340 (0.00%) 1452 -> 1452 (0.00%) > SKL: 0 -> 120 8204037 -> 7939086 (-3.23%) 66583900 -> 65624378 (-1.44%) > 1269 -> 375 (-70.45%) 1563 -> 690 (-55.85%)
I'm a bit surprised that Gen7-8 lost SIMD16 programs. Presumably there are some cases where we don't need the whole cacheline worth of pulled data, and this increased register pressure. I suppose that could be fixed by demoting pull message return length when the last channels aren't used. We might want to do that later on. --Ken
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev