Re: [Mesa-dev] r600g: status of my work on the shader optimization

Vadim Girlin Sat, 16 Feb 2013 10:24:00 -0800

On 02/16/2013 11:10 AM, Mathias � wrote:


Hi,

On Friday, February 15, 2013 15:00:24 Vadim Girlin wrote:

"LLVM backend is the future" is a pretty abstract argument. I prefer to
operate with real facts. After a year of LLVM backend development what
are the real benefits for the users? What are the real use cases where
the users might prefer LLVM backend? To me this situation looks like the
use of LLVM requires a lot more time and development efforts than the
custom solution, despite the initial expectations. Maybe you are right
and the LLVM backend will become the best alternative for users sometime
in the future, but I only have some today's results:


I am curious how this compares for shaders like used in

git clone
git://anarchy.freedesktop.org/~frohlich/PrecomputedAtmosphericScattering.git

which is one of the bigger programs I know of. That one is taken from a Paper
from INRIA regarding atmospheric scattering. The git archive is just some
striped down variants of that to make it at least display with our radeon type
drivers. You need float textures enabled in the configure step. The only
variants that have a chance to run on the oss drivers is Main.noprecompute and
Main.nogeometry once you compiled this dirty proof of concept files.

That saied, the shaders there are far from programmed optimal IMO. But they
render fast on the binary blobs. And I think they are a good example what you
will find in the wild. People out there expect to find a decent compiler even in
an OpenGL driver. I mean a compiler that does not just translate a tight pre
optimized shader into the apropriate backend language, but a compiler that
knows about all the tricks that are required to optimize more complex and less
good written programs. Doing good optimizers - not only the backend ones - is
a hard and longish busines which is easily underestimated.

(@Mathias, sorry, I accidentally sent the unfinished draft of this replyas a private mail, see added results with your program in the end)

The problem is that LLVM knows nothing about optimization for r600architecture. Yes, there are tricks, but these tricks are oftenhardware-specific. Many tricks that are useful for other architecturesare useless for r600, or even make the result worse. Many tricks thatare useful for r600 are not implemented in LLVM. You have to implementthem as a custom pass anyway, working with not very convenient coderepresentation. Another way is to simply implement what you need in theany way you like without dealing with the LLVM.

Also, even the most common tricks often don't work as expected withoutadditional work.


Let me show a small example. Here is a short test shader:

uniform int a;
uniform float b;
uniform float c;
void main()
{
        vec4 t = vec4(1.0);
        float q;
        
        for (int k = 0; k < a; ++k) {
                q = q + 0.1;
                t.x += q * sin(b * 3.0 * c + 10.0);
                t.y *= sin(b * 3.0) * cos(b * 4.0);
        }
        gl_FragColor = t;
}

It's not very optimal, because most of expressions inside the loop donot depend on loop variable, so we can expect some smart trick fromLLVM, e.g. that it will move all loop-invariant expressions outside ofthe loop.


Here is the code produced by LLVM backend:

--------------------------------------------------------------
bytecode 80 dw -- 3 gprs -- 4 stack entries -------
shader 11 -- E
0000 4000000A A0140000  ALU 6 @20 KC0[CB0:0-16]
 0020 000000F9 00000C90     1      MOV                      R0.x,  1.0
 0022 000000F9 20000C90            MOV                      R0.y,  1.0
 0024 000000F8 40000C90            MOV                      R0.z,  0
 0026 000000F8 60000C90            MOV                      R0.w,  0
 0028 801FA081 40400110            MUL_IEEE                 R2.z,  KC0[1].x, 
[0x40800000 4.000000]
 0030 40800000                                               4.000000 
(1082130432)
0002 00000008 81800000  LOOP_START_DX10 @16
0004 40000010 A4040000  ALU_PUSH_BEFORE 2 @32 KC0[CB0:0-16]
 0032 81800082 00201D90     2      SETGT_INT                R1.x,  KC0[2].x, 
R0.w
 0034 801F00FE 00002104     3 M    PRED_SETE_INT            __.x,  PV.x, 0
0006 00000006 82800001  JUMP @12 POP:1
0008 00000007 82400000  LOOP_BREAK @14
0010 00000006 83800001  POP @12 POP:1
0012 40000012 A0400000  ALU 17 @36 KC0[CB0:0-16]
 0036 001FA081 00200110     4      MUL_IEEE                 R1.x,  KC0[1].x, 
[0x40400000 3.000000]
 0038 000004FD 20200C90            MOV                      R1.y,  [0x3E22F983 
0.159155]
 0040 011FA800 40000010            ADD                      R0.z,  R0.z, 
[0x3DCCCCCD 0.100000]
 0042 801F4C00 60001A10            ADD_INT                  R0.w,  R0.w, 1
 0044 40400000                                               3.000000 
(1077936128)
 0045 3E22F983                                               0.159155 
(1042479491)
 0046 3DCCCCCD                                               0.100000 
(1036831949)
 0048 001000FE 204300FD     5      MULADD_IEEE              R2.y,  PV.x, 
KC0[0].x, [0x41200000 10.000000]
 0050 001FC4FE 40200090            MUL                      R1.z,  PV.y, PV.x
 0052 810044FE 60200090            MUL                      R1.w,  PV.y, R2.z
 0054 41200000                                               10.000000 
(1092616192)
 0056 009FC401 00200090     6      MUL                      R1.x,  R1.y, PV.y
 0058 800008FE 20204690            SIN                      R1.y,  PV.z
 0060 80000C01 40204710     7      COS                      R1.z,  R1.w
 0062 001FE401 20200110     8      MUL_IEEE                 R1.y,  R1.y, PS
 0064 80000001 00204690            SIN                      R1.x,  R1.x
 0066 009FC000 00000110     9      MUL_IEEE                 R0.x,  R0.x, PV.y
 0068 801FE800 20030400            MULADD_IEEE              R0.y,  R0.z, PS, 
R0.y
0014 00000002 81400000  LOOP_END @4
0016 00000023 A0100000  ALU 5 @70
 0070 800000F9 00200C90    10      MOV                      R1.x,  1.0
 0072 00000400 80200C90    11      MOV_sat                  R1.x,  R0.y
 0074 00000000 A0200C90            MOV_sat                  R1.y,  R0.x
 0076 800000FE C0200C90            MOV_sat                  R1.z,  PV.x
 0078 800008FE 60200C90    12      MOV                      R1.w,  PV.z
0018 C0008000 95200688  EXPORT_DONE        PIXEL 0     R1.xyzw      ES:3 EOP
--------------------------------------

I'm not sure if you are familiar with r600 ISA, but I can explain - LLVMmoved the computation of the expression "b * 4.0" outside of the loop,everything else is left inside the loop. It's obviously possible to move"sin(b * 3.0) * cos(b * 4.0)" and "sin(b * 3.0 * c + 10.0)", but LLVMmissed this opportunity.


Here is what my branch does with the code above:

===== SHADER_START ================================== PS/JUNIPER/EVERGREEN =====
===== 72 dw ===== 2 gprs ===== 2 stack =========================================
0000  4000000a a0400000 ALU 17 @20 KC0[CB0:0-16]
 0020  001fa081 0fa00110     1      MUL_IEEE              T1.x,  KC0[1].x, 
[0x40400000 3].x
 0022  809fa081 2f800110            MUL_IEEE              T0.y,  KC0[1].x, 
[0x40800000 4].y
 0024  40400000
 0025  40800000
 0026  0010007d 0f8300fd     2      MULADD_IEEE           T0.x,  T1.x, 
KC0[0].x, [0x41200000 10].x
 0028  808f84fd 2f800090            MUL                   T0.y,  [0x3e22f983 
0.159155].y, T0.y
 0030  41200000
 0031  3e22f983
 0032  000f80fd 0f800090     3      MUL                   T0.x,  [0x3e22f983 
0.159155].x, T0.x
 0034  000fa0fd 4f840090            MUL                   T0.z,  [0x3e22f983 
0.159155].x, T1.x  BS:1   VEC_021
 0036  8000047c 2f804710            COS                   T0.y,  T0.y
 0038  3e22f983
 0040  000000f9 00000c90     4      MOV                   R0.x,  1.0
 0042  8000087c 4f804690            SIN                   T0.z,  T0.z
 0044  008f887c 00200110     5      MUL_IEEE              R1.x,  T0.z, T0.y
 0046  000000f9 20000c90            MOV                   R0.y,  1.0
 0048  000000f8 40000c90            MOV                   R0.z,  0
 0050  000000f8 60000c90            MOV                   R0.w,  0
 0052  8000007c 20204690            SIN                   R1.y,  T0.x
0002  00000008 81800000 LOOP_START_DX10 @16
0004  4000001b a4040000 ALU_PUSH_BEFORE 2 @54 KC0[CB0:0-16]
 0054  81800082 4f801d90     6      SETGT_INT             T0.z,  KC0[2].x, R0.w
 0056  801f087c 00002104     7 M    PRED_SETE_INT         __.x,  T0.z, 0
0006  00000006 82800001 JUMP @12 POP:1
0008  00000007 82400000 LOOP_BREAK @14
0010  00000000 83800001 POP @0 POP:1
0012  0000001d a0100000 ALU 5 @58
 0058  801fa800 40000010     8      ADD                   R0.z,  R0.z, 
[0x3dcccccd 0.1].x
 0060  3dcccccd
 0062  00802800 00030000     9      MULADD_IEEE           R0.x,  R0.z, R1.y, 
R0.x
 0064  00002400 20000110            MUL_IEEE              R0.y,  R0.y, R1.x
 0066  801f4c00 60001a10            ADD_INT               R0.w,  R0.w, 1
0014  00000002 81400000 LOOP_END @4
0016  00000022 a0040000 ALU 2 @68
 0068  00000000 80000c90    10      MOV_sat               R0.x,  R0.x
 0070  80000400 a0000c90            MOV_sat               R0.y,  R0.y
0018  c0000000 95200b48 EXPORT_DONE        PIXEL 0    R0.xy11  EOP
===== SHADER_END ===============================================================

Most computations are now done before the loop, including expensive SINand COS operations. Main loop body is now consists of 2 simple VLIWinstructions - additions and multiplications, instead of 6 generated byLLVM.

As you can see, LLVM is not a magic tool that does all tricks withoutadditional efforts.

Why LLVM wasn't able to do the same? I don't know. Of course, I canspend some time to investigate this, read LLVM code and figure out thereason, and probably fix it somehow, maybe by adding customimplementation. But I prefer to implement some simple algorithm thatworks as I expect from the beginning and forget about it, instead ofinvestigating every case like with the example above. If there is somespecial trick in LLVM that is missing in my branch, it's easier for meto implement the same trick using some more simple algorithm, than totry to make it work as I want with LLVM. If you have to spend your timeon this anyway, then what's a benefit of LLVM?

There is another problem in the example above - the sequence of MOVinstructions after the loop.


E.g. if you have the following code:

MOV R1.x, 5,
MOV R1.y, 5,

LLVM "optimizes" it :

MOV R1.x, 5
MOV R1.y, R1.x

The problem is that original variant can be executed on r600 as a singleVLIW instruction in a single cycle, second variant introduces datadependency between instructions and now they have to be executedsequentially, requiring two cycles. Why LLVM does that? Just becauseit's good for some other architecture. LLVM doesn't know that in theoriginal code the instructions can be executed in parallel. So now youhave to spend your time to find a way to disable this "optimization"that only makes the code worse for r600. To me it looks like a waste oftime.

I'm not saying that LLVM is completely useless - I'm sure it's veryuseful for other (more conventional) architectures or when you also needsome additional standalone tools and can reuse a lot of existing code,it's just not very helpful in this particular case - compilation of GLshaders for the r600 architecture.

Regarding your program, using the LIBGL_SHOW_FPS variable to compare FPSwith Main.noprecompute gives me the following results on my HD5750:


R600_LLVM=0 R600_SB=0 : FPS = 144.1
R600_LLVM=1 R600_SB=0 : FPS = 288.1
R600_LLVM=1 R600_SB=1 : FPS = 518.4
R600_LLVM=0 R600_SB=1 : FPS = 527.2

Vadim

In the long term I would vote for your knowledge about these machines
available in llvm to get the best of both worlds.

my 2 cents...

Mathias
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] r600g: status of my work on the shader optimization

Reply via email to