Re: [Mesa-dev] [PATCH] nouveau: codegen: combineLd/St do not combine indirect loads

Hans de Goede Thu, 21 Apr 2016 08:41:14 -0700

Hi,

On 21-04-16 17:09, Samuel Pitoiset wrote:



On 04/21/2016 04:46 PM, Hans de Goede wrote:

Hi,

On 21-04-16 16:28, Ilia Mirkin wrote:

On Thu, Apr 21, 2016 at 9:55 AM, Hans de Goede <hdego...@redhat.com>
wrote:

combineLd/St would combine, i.e. :

st  u32 # g[$r2+0x0] $r2
st  u32 # g[$r2+0x4] $r3

into:

st  u64 # g[$r2+0x0] $r2d

But this is only valid if r2 contains an 8 byte aligned address,
which is unknown.

This commit checks for src0 dim 0 not being indirect when combining
loads / stores as combining indirect loads / stores may break alignment
rules.


I believe the assumption is that all indirect addresses are 16-byte
aligned. This works out for GL, I think. Although hm... I wonder what
happens if you have a

layout (std430) buffer foo {
   int x[16];
}

And you access x[i], x[i+1], and i == 1. I think we end up doing a ton
of size-based validation which might avoid the problem.

My concern is that now constbufs will get the same treatment, and for
constbufs the alignment is always 16 :(

What do you think? Just drop those, or add extra conditionals to allow
it for constbufs?


I'm not sure we've the alignment guarantee for constbufs, IIRC we lower
const buf accesses to be indirect because we want to provide more then 8
UBO-s,
right ? So we read the offset from NVC0_CB_AUX_BUF_INFO and then end up
with e.g.:


Right. This is because the launch descriptor used for compute shaders on kepler 
only allows to set up 8 CBs. But OpenGL requires at least 14 UBOs, so the logic 
is to stick UBOs' information into the driver constant buffer.

As you can, we do this dance for all UBOs because it's simpler that testing if 
an UBO has been described in the launch descriptor or not (so if it's mapped as 
c1[], c2[], etc).

The lowering pass should properly handle indirect UBO accesses (I did write a 
piglit test for that and looked at blob). But I'm not sure if we can break 
alignment here.

Do you have a simple shader that might hit the issue?


I'm definitely hitting the issue with opencl programs,
specifically with:

piglit/tests/cl/program/execute/get-num-groups.cl

Which contains:

kernel void fill3d(global int* out) {
        unsigned int id =  get_global_id(0) + get_global_size(0)*get_global_id(1
        out[3*id] = get_num_groups(0);
        out[3*id+1] = get_num_groups(1);
        out[3*id+2] = get_num_groups(2);
}

Notice the 3 * id, we end up combining
get_num_groups(0) and get_num_groups(1)
into a single 64 bit store, which for
(id % 2 == 1) results in an unaligned trap
on the gpu.

Interestingly enough this is the only piglet cl
test which triggers this, but still this is a real
problem AFAICT.

Note this gets translated into:

COMP
DCL SV[0], BLOCK_ID
DCL SV[1], BLOCK_SIZE
DCL SV[2], GRID_SIZE
DCL SV[3], THREAD_ID
DCL MEMORY[0], GLOBAL
DCL MEMORY[1], SHARED
DCL MEMORY[2], PRIVATE
DCL MEMORY[3], INPUT

IMM[0] UINT32 {2, 0, 0, 0}
IMM[1] UINT32 {0, 0, 0, 0}
  0: BGNSUB :0
  1:   UMUL TEMP[1].x, SV[1].xxxx, SV[0].xxxx
  2:   UADD TEMP[1].x, SV[3].xxxx, TEMP[1].xxxx
  3:   SHL TEMP[1].x, TEMP[1].xxxx, IMM[0].xxxx
  4:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
  5:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
  6:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
  7:   RET
  8: ENDSUB
IMM[2] UINT32 {3, 0, 0, 0}
IMM[3] UINT32 {4, 0, 0, 0}
  9: BGNSUB :0
 10:   UMUL TEMP[1].x, SV[1].yyyy, SV[0].yyyy
 11:   UADD TEMP[1].x, SV[3].yyyy, TEMP[1].xxxx
 12:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
 13:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
 14:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
 15:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
 16:   SHL TEMP[1].x, TEMP[1].xxxx, IMM[2].xxxx
 17:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
 18:   UADD TEMP[1].z, TEMP[1].yyyy, TEMP[1].xxxx
 19:   STORE MEMORY[0].x, TEMP[1].zzzz, SV[2].xxxx
 20:   OR TEMP[1].x, TEMP[1].xxxx, IMM[3].xxxx
 21:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
 22:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].yyyy
 23:   RET
 24: ENDSUB
IMM[4] UINT32 {12, 0, 0, 0}
IMM[5] UINT32 {8, 0, 0, 0}
 25: BGNSUB :0
 26:   UMUL TEMP[1].x, SV[1].zzzz, SV[0].zzzz
 27:   UADD TEMP[1].x, SV[3].zzzz, TEMP[1].xxxx
 28:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].yyyy
 29:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].yyyy
 30:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].yyyy
 31:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].yyyy
 32:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
 33:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
 34:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
 35:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
 36:   UMUL TEMP[1].x, TEMP[1].xxxx, IMM[4].xxxx
 37:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
 38:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
 39:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
 40:   UADD TEMP[1].y, TEMP[1].xxxx, IMM[3].xxxx
 41:   STORE MEMORY[0].x, TEMP[1].yyyy, SV[2].yyyy
 42:   UADD TEMP[1].x, TEMP[1].xxxx, IMM[5].xxxx
 43:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].zzzz
 44:   RET
 45: ENDSUB

With the SUB beginning at 25: corresponding to
the troublesome fill3d function.

With my fix the generated code for this is:

  0: rdsv u32 $r0 sv[CTAID:2] (8)
  1: rdsv u32 $r1 sv[TID:2] (8)
  2: mad u32 $r2 $r0 c7[0xe8] $r1 (8)
  3: ld  u64 $r0d c7[0xf0] (8)
  4: rdsv u32 $r3 sv[CTAID:1] (8)
  5: mad u32 $r2 $r2 c7[0xf0] $r3 (8)
  6: rdsv u32 $r3 sv[TID:1] (8)
  7: mad u32 $r2 $r2 c7[0xe4] $r3 (8)
  8: ld  u32 $r3 c7[0xec] (8)
  9: rdsv u32 $r4 sv[CTAID:0] (8)
 10: mad u32 $r2 $r2 c7[0xec] $r4 (8)
 11: rdsv u32 $r4 sv[TID:0] (8)
 12: mad u32 $r2 $r2 c7[0xe0] $r4 (8)
 13: mov u32 $r4 0x0000000c (8)
 14: mad u32 $r2 $r2 $r4 c0[0x0] (8)
 15: st  u32 # g[$r2+0x0] $r3 (8)
 16: add u32 $r3 $r2 0x00000004 (8)
 17: st  u32 # g[$r2+0x4] $r0 (8)
 18: add u32 $r0 $r2 0x00000008 (8)
 19: st  u32 # g[$r2+0x8] $r1 (8)
 20: ret (8)

Notice that this code also seems to hit
another bug, instructions 16 and 18
got folded into the "st" instructions as offset,
but they did not get deleted.

Any clues for where to start looking at
the root cause of that are welcome.

###

As for Ilia's solution to not disallow
combining of indirect loads for constbufs
given the discussion that seems sensible,
at least for ubo-s, for opencl the
input parameters may end up being indirectly
accessed in an unaligned matter too.

Regards,

Hans


ld u64  r2d c7[r1+0x0]

Where r1 contains the offset of the user-buf. But what if the user is
somehow
indirectly accessing the userbuf, then we will have added that indirect
offset
to r1, and we can no longer assume that we can safely merge the loads
without
breaking alignment rules.

I hope I'm making sense here, I'm still a bit unsure about the details how
this all works.

Regards,

Hans

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH] nouveau: codegen: combineLd/St do not combine indirect loads

Reply via email to