Hi,

This patch series implements a new compilation mode that compiles shaders to hw 
bytecode only once with the assumption that any state-dependent code will be 
attached at the beginning or end of the bytecode to implement emulated features 
such as vertex buffer addressing, two-side color selection and interpolation, 
colorbuffer format conversions, alpha-test, etc. (the attachable bytecode will 
be called "prolog" and "epilog" shader parts, while the TGSI shader will be 
called the "main" part)

At the end, it adds a simple TGSI->bytecode shader cache that lives in memory.


1) Design points and differences from my XDC talk

The support of the old-style shaders compiled on demand (called "monolithic", 
because there is only one monolithic piece of bytecode) is kept. It can be 
enabled by an environment variable or it's enabled automatically if LLVM is < 
3.8.

Shaders keep their shader key, but now the shader key is used to generate the 
prolog and epilog parts.

The main part is compiled first. At draw time, the prolog and epilog, if they 
are needed, are compiled and all pieces of bytecode are combined. Ideally, we 
would only be doing the combining at draw time, because everything should be 
compiled already.

Prologs and epilogs don't use the LLVM assembler as was planned initially. They 
share most of the code with monolithic shaders, meaning that each is compiled 
as an LLVM IR module.

The driver keeps a global per-screen list of all compiled prologs and epilogs, 
because they are all reusable.

If prolog and epilog compilation turns out to be too slow, we can precompile 
some of them with llc at Mesa compile time. I don't think this will be needed 
though.

VS and TES main parts are always compiled as hardware VS at shader creation. 
Hardware LS and ES stages are always compiled as monolithic shaders on demand 
later due to the lack of games using those.


2) Shader parts

VS prolog:
- vertex buffer address calculations based on instance divisors

VS epilog (hw VS only: VS & TES):
- primitive ID export if PS needs it
- in the future: ignore ClipVertex and ClipDistance outputs if clipping is 
disabled

TCS epilog:
- pack tessellation factors based on the TES primitive type

PS prolog:
- two-side color selection and interpolation
- forcing per-sample interpolation
- polygon stippling
- in the future: support BC_OPTIMIZE better, use interp_mov for flatshaded 
colors

PS epilog:
- alpha-test, alpha-to-one, smoothing, clamping, gl_FragColor broadcast
- color format conversions


3) Performance implications

There is increased VGPR usage because pixel shaders that used to use 4-12 VGPRs 
now always use 16 or even 20. This is not enough to affect the wave count 
though.

There is slightly higher register usage because some SGPRs and VGPRs have to be 
passed from the prolog through the main part to the epilog, so the main part 
has fewer of them. This results in higher SGPR spilling, although that should 
be entirely fixable in the LLVM backend.

Relevant shader-db stats for the default scheduler:

Code Size: 11091656 -> 11219948 (1.16 %) bytes
Scratch: 1732608 -> 2246656 (29.67 %) (SGPR spilling)
Max Waves: 78063 -> 77352 (-0.91 %)

Relevant shader-db stats for the SI scheduler:

Code Size: 11433068 -> 11535452 (0.90 %) bytes
Scratch: 509952 -> 522240 (2.41 %) (SGPR spilling)
Max Waves: 79456 -> 78217 (-1.56 %)

Both the code size and the wave count didn't change much. It looks like 
compiling optimized monolithic shaders in another thread won't make much 
difference.

No benchmarks have been run.


4) RadeonSI shader cache in memory

The motivation is to skip shader compilation for TGSI shaders that have already 
been compiled by the same process before. This is not a real shader cache that 
proprietary drivers implement. The binaries are not stored on the disk. The 
motivations are:
- Apps mix and match their vertex and pixel shaders to produce many 
combinations of linked GLSL shader programs. E.g. if one VS is matched with 20 
pixel shaders, we don't want to compile that VS 20 times. This does appear to 
happen a lot with UE3.
- If apps unload and reload shaders, this effectively makes the reload free for 
the radeonsi driver. (not so much for st/mesa)
- Gallium likes to use the same blit & pass-through shaders in several places.

This only caches the main shader parts (VS as VS, TCS, TES as VS, PS). 
Monolithic shaders including LS & ES and also GS are not cached.


5) Performance of the shader cache

The test is a short apitrace of Borderlands 2.

Without the cache:
GLSL link time = 18361 ms
Driver compile time = 14510 ms

With the cache:
GLSL link time = 12576 ms
Driver compile time = 8552 ms

This leaves a lot to be desired, but it was expected. The TGSI compilation 
takes 41% less time, which means 41% of all TGSI shaders are duplicates. On 
average, linking GLSL shader programs (including the TGSI compilation) takes 
31.5% less time.

The compile times are still unacceptable and caching shaders on the disk 
appears to be a necessity. A radeonsi-only cache on the disk should be 
relatively easy with the current cache in memory, but 33% of the compilation 
time is not spent in radeonsi.

6) Piglit regressions

Since shaders are now always compiled all the way to the bytecode by 
glLinkProgram, it uncovered a few glsl_to_tgsi bugs creating invalid TGSI 
shaders and failing assertions in the driver.

Please review.

Marek
_______________________________________________
mesa-dev mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to