On Thu, Aug 11, 2016 at 11:29 AM, Nicolai Hähnle <nhaeh...@gmail.com> wrote: > On 10.08.2016 23:36, Marek Olšák wrote: >> >> On Wed, Aug 10, 2016 at 9:23 PM, Nicolai Hähnle <nhaeh...@gmail.com> >> wrote: >>> >>> Hi, >>> >>> this is a respin of the series which scans the shader's TGSI to determine >>> which channels of an array are actually written to. Most of the st/mesa >>> changes have become unnecessary. Most of the radeon-specific part stays >>> the same. >>> >>> For one F1 2015 shader, it reduces the scratch size from 132096 to 26624 >>> bytes, which is bound to be much nicer on the texture cache. >> >> >> This has been bugging me... is there something we can do to move >> temporary arrays to registers? >> >> F1 2015 is the only game that doesn't "spill VGPRs", yet has the >> highest scratch usage per shader. (without this series) >> >> If a shader uses 32 VGPRs and a *ton* of scratch space, you know >> something is wrong. > > > We actually already do that partially: in emit_declaration, we check the > size of the array, and if it's below a certain threshold (<= 16 currently) > it is lowered to LLVM IR that becomes registers. In particular, that one > shader has: > > Before: Shader Stats: SGPRS: 40 VGPRS: 32 Code Size: 3316 LDS: 0 Scratch: > 132096 Max Waves: 8 Spilled SGPRs: 0 Spilled VGPRs: 0 > After: Shader Stats: SGPRS: 32 VGPRS: 60 Code Size: 3068 LDS: 0 Scratch: > 26624 Max Waves: 4 Spilled SGPRs: 0 Spilled VGPRs: 0 > > Looks like some of the arrays now land in VGPRs since they have become > smaller with that series. > > There are still a _lot_ of weaknesses in all of this, and they mostly have > to do with limitations that are rather deeply baked into assumptions of > LLVM's codegen architecture. > > The biggest problem is that an array in VGPRs needs to be represented > somehow in the codegen, and it is currently being represented as one of the > VGPR vector register classes, which go up to VReg_512, i.e. 16 registers. > Two problems with that: > > 1. The granularity sucks. If you have an array of 10 entries, it'll end up > effectively using 16 registers anyway. > > 2. You can't go above arrays of size 16. (Though to be fair, once you reach > that size, you should probably start worrying about VGPR pressure.) > > Some other issues are that > > 3. It should really be LLVM that decides how to lower an array, not Mesa. > Ideally, LLVM should be able to make an intelligent decision based on the > overall register pressure. > > 4. We currently don't use LDS for shaders. This was disabled because LLVM > needs to be taught about interactions with other LDS uses, especially in > tessellation.
I think we should first focus on PS and CS. Sadly, LDS is pretty small. We can spill at most 128 dwords (256 on CIK/VI) per thread, but all LDS is used at that point and the wave count is 1 per CU (0.25 per SIMD) = worse than scratch. A more conservative approach is to have a maximum of 16 (32 - CIK/VI) dwords of LDS per thread, which should give us 2 waves per SIMD (with zero PS inputs and no DDX/DDY) or 1-2 waves per SIMD depending on the number of PS inputs (but never 2 on all SIMDs). Marek _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev