On 11.08.2016 12:57, Marek Olšák wrote:
On Thu, Aug 11, 2016 at 11:29 AM, Nicolai Hähnle <nhaeh...@gmail.com> wrote:
On 10.08.2016 23:36, Marek Olšák wrote:

On Wed, Aug 10, 2016 at 9:23 PM, Nicolai Hähnle <nhaeh...@gmail.com>
wrote:

Hi,

this is a respin of the series which scans the shader's TGSI to determine
which channels of an array are actually written to. Most of the st/mesa
changes have become unnecessary. Most of the radeon-specific part stays
the same.

For one F1 2015 shader, it reduces the scratch size from 132096 to 26624
bytes, which is bound to be much nicer on the texture cache.


This has been bugging me... is there something we can do to move
temporary arrays to registers?

F1 2015 is the only game that doesn't "spill VGPRs", yet has the
highest scratch usage per shader. (without this series)

If a shader uses 32 VGPRs and a *ton* of scratch space, you know
something is wrong.


We actually already do that partially: in emit_declaration, we check the
size of the array, and if it's below a certain threshold (<= 16 currently)
it is lowered to LLVM IR that becomes registers. In particular, that one
shader has:

Before: Shader Stats: SGPRS: 40 VGPRS: 32 Code Size: 3316 LDS: 0 Scratch:
132096 Max Waves: 8 Spilled SGPRs: 0 Spilled VGPRs: 0
After: Shader Stats: SGPRS: 32 VGPRS: 60 Code Size: 3068 LDS: 0 Scratch:
26624 Max Waves: 4 Spilled SGPRs: 0 Spilled VGPRs: 0

Looks like some of the arrays now land in VGPRs since they have become
smaller with that series.

There are still a _lot_ of weaknesses in all of this, and they mostly have
to do with limitations that are rather deeply baked into assumptions of
LLVM's codegen architecture.

The biggest problem is that an array in VGPRs needs to be represented
somehow in the codegen, and it is currently being represented as one of the
VGPR vector register classes, which go up to VReg_512, i.e. 16 registers.
Two problems with that:

1. The granularity sucks. If you have an array of 10 entries, it'll end up
effectively using 16 registers anyway.

2. You can't go above arrays of size 16. (Though to be fair, once you reach
that size, you should probably start worrying about VGPR pressure.)

Some other issues are that

3. It should really be LLVM that decides how to lower an array, not Mesa.
Ideally, LLVM should be able to make an intelligent decision based on the
overall register pressure.

4. We currently don't use LDS for shaders. This was disabled because LLVM
needs to be taught about interactions with other LDS uses, especially in
tessellation.

I think we should first focus on PS and CS. Sadly, LDS is pretty
small. We can spill at most 128 dwords (256 on CIK/VI) per thread, but
all LDS is used at that point and the wave count is 1 per CU (0.25 per
SIMD) = worse than scratch. A more conservative approach is to have a
maximum of 16 (32 - CIK/VI) dwords of LDS per thread, which should
give us 2 waves per SIMD (with zero PS inputs and no DDX/DDY) or 1-2
waves per SIMD depending on the number of PS inputs (but never 2 on
all SIMDs).

You're right, I got confused and did my calculation assuming the LDS was per-SIMD. You basically want an LDS-use per thread that corresponds to a bit less than 1/8 (SI) or 1/4 (CIK/VI) of the number of VGPRs. BTW, I think VI can do DDX/DDY without actually using LDS memory (via the ds_permute instructions). Anyway, that makes LDS much less effective.

Nicolai

Marek

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to