On Friday, December 9, 2016 11:03:29 AM PST Francisco Jerez wrote: > Asking the DC for less than one cacheline (4 owords) of data for > uniform pull constants is suboptimal because the DC cannot request > less than that from L3, resulting in wasted bandwidth and unnecessary > message dispatch overhead, and exacerbating the IVB L3 serialization > bug. The following table summarizes the overall framerate improvement > (with statistical significance of 5% and sample size ~10) from the > whole series up to this patch for several benchmarks and hardware > generations: > > | SKL | BDW | HSW > SynMark2 OglShMapPcf | 24.63% ±0.45% | 4.01% ±0.70% | 10.31% ±0.38% > GfxBench4 gl_manhattan31 | 5.93% ±0.35% | 3.92% ±0.31% | 6.62% ±0.22% > GfxBench4 gl_4 | 2.52% ±0.44% | 1.23% ±0.10% | N/A > Unigine Valley | 0.83% ±0.17% | 0.23% ±0.05% | 0.74% ±0.45%
I suspect OglShMapPcf gained SIMD16 on Skylake due to reduced register pressure, from the lower message lengths on pull loads. (At least, it did when I had a series to fix that.) That's probably a large portion of the performance improvement here, and why it's so much larger for that workload on Skylake specifically. It might be worth mentioning it in your commit message here. Thanks for all your work on this. I was originally concerned about the Ivybridge bug, but given that we're loading one cacheline at a time, it seems very unlikely that we'd ever load the same cacheline twice within 16 cycles. We could if we have IF (non-uniform) load[foo] ELSE load[bar] where foo and bar are indirect expressions that happen to be equal. But that seems quite uncommon. Series is: Reviewed-by: Kenneth Graunke <kenn...@whitecape.org>
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev