Hi,
On 21-04-16 17:09, Samuel Pitoiset wrote:
On 04/21/2016 04:46 PM, Hans de Goede wrote:
Hi,
On 21-04-16 16:28, Ilia Mirkin wrote:
On Thu, Apr 21, 2016 at 9:55 AM, Hans de Goede <hdego...@redhat.com>
wrote:
combineLd/St would combine, i.e. :
st u32 # g[$r2+0x0] $r2
st u32 # g[$r2+0x4] $r3
into:
st u64 # g[$r2+0x0] $r2d
But this is only valid if r2 contains an 8 byte aligned address,
which is unknown.
This commit checks for src0 dim 0 not being indirect when combining
loads / stores as combining indirect loads / stores may break alignment
rules.
I believe the assumption is that all indirect addresses are 16-byte
aligned. This works out for GL, I think. Although hm... I wonder what
happens if you have a
layout (std430) buffer foo {
int x[16];
}
And you access x[i], x[i+1], and i == 1. I think we end up doing a ton
of size-based validation which might avoid the problem.
My concern is that now constbufs will get the same treatment, and for
constbufs the alignment is always 16 :(
What do you think? Just drop those, or add extra conditionals to allow
it for constbufs?
I'm not sure we've the alignment guarantee for constbufs, IIRC we lower
const buf accesses to be indirect because we want to provide more then 8
UBO-s,
right ? So we read the offset from NVC0_CB_AUX_BUF_INFO and then end up
with e.g.:
Right. This is because the launch descriptor used for compute shaders on kepler
only allows to set up 8 CBs. But OpenGL requires at least 14 UBOs, so the logic
is to stick UBOs' information into the driver constant buffer.
As you can, we do this dance for all UBOs because it's simpler that testing if
an UBO has been described in the launch descriptor or not (so if it's mapped as
c1[], c2[], etc).
The lowering pass should properly handle indirect UBO accesses (I did write a
piglit test for that and looked at blob). But I'm not sure if we can break
alignment here.
Do you have a simple shader that might hit the issue?
I'm definitely hitting the issue with opencl programs,
specifically with:
piglit/tests/cl/program/execute/get-num-groups.cl
Which contains:
kernel void fill3d(global int* out) {
unsigned int id = get_global_id(0) + get_global_size(0)*get_global_id(1
out[3*id] = get_num_groups(0);
out[3*id+1] = get_num_groups(1);
out[3*id+2] = get_num_groups(2);
}
Notice the 3 * id, we end up combining
get_num_groups(0) and get_num_groups(1)
into a single 64 bit store, which for
(id % 2 == 1) results in an unaligned trap
on the gpu.
Interestingly enough this is the only piglet cl
test which triggers this, but still this is a real
problem AFAICT.
Note this gets translated into:
COMP
DCL SV[0], BLOCK_ID
DCL SV[1], BLOCK_SIZE
DCL SV[2], GRID_SIZE
DCL SV[3], THREAD_ID
DCL MEMORY[0], GLOBAL
DCL MEMORY[1], SHARED
DCL MEMORY[2], PRIVATE
DCL MEMORY[3], INPUT
IMM[0] UINT32 {2, 0, 0, 0}
IMM[1] UINT32 {0, 0, 0, 0}
0: BGNSUB :0
1: UMUL TEMP[1].x, SV[1].xxxx, SV[0].xxxx
2: UADD TEMP[1].x, SV[3].xxxx, TEMP[1].xxxx
3: SHL TEMP[1].x, TEMP[1].xxxx, IMM[0].xxxx
4: LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
5: UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
6: STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
7: RET
8: ENDSUB
IMM[2] UINT32 {3, 0, 0, 0}
IMM[3] UINT32 {4, 0, 0, 0}
9: BGNSUB :0
10: UMUL TEMP[1].x, SV[1].yyyy, SV[0].yyyy
11: UADD TEMP[1].x, SV[3].yyyy, TEMP[1].xxxx
12: UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
13: UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
14: UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
15: UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
16: SHL TEMP[1].x, TEMP[1].xxxx, IMM[2].xxxx
17: LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
18: UADD TEMP[1].z, TEMP[1].yyyy, TEMP[1].xxxx
19: STORE MEMORY[0].x, TEMP[1].zzzz, SV[2].xxxx
20: OR TEMP[1].x, TEMP[1].xxxx, IMM[3].xxxx
21: UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
22: STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].yyyy
23: RET
24: ENDSUB
IMM[4] UINT32 {12, 0, 0, 0}
IMM[5] UINT32 {8, 0, 0, 0}
25: BGNSUB :0
26: UMUL TEMP[1].x, SV[1].zzzz, SV[0].zzzz
27: UADD TEMP[1].x, SV[3].zzzz, TEMP[1].xxxx
28: UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].yyyy
29: UADD TEMP[1].x, TEMP[1].xxxx, SV[0].yyyy
30: UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].yyyy
31: UADD TEMP[1].x, TEMP[1].xxxx, SV[3].yyyy
32: UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
33: UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
34: UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
35: UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
36: UMUL TEMP[1].x, TEMP[1].xxxx, IMM[4].xxxx
37: LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
38: UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
39: STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
40: UADD TEMP[1].y, TEMP[1].xxxx, IMM[3].xxxx
41: STORE MEMORY[0].x, TEMP[1].yyyy, SV[2].yyyy
42: UADD TEMP[1].x, TEMP[1].xxxx, IMM[5].xxxx
43: STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].zzzz
44: RET
45: ENDSUB
With the SUB beginning at 25: corresponding to
the troublesome fill3d function.
With my fix the generated code for this is:
0: rdsv u32 $r0 sv[CTAID:2] (8)
1: rdsv u32 $r1 sv[TID:2] (8)
2: mad u32 $r2 $r0 c7[0xe8] $r1 (8)
3: ld u64 $r0d c7[0xf0] (8)
4: rdsv u32 $r3 sv[CTAID:1] (8)
5: mad u32 $r2 $r2 c7[0xf0] $r3 (8)
6: rdsv u32 $r3 sv[TID:1] (8)
7: mad u32 $r2 $r2 c7[0xe4] $r3 (8)
8: ld u32 $r3 c7[0xec] (8)
9: rdsv u32 $r4 sv[CTAID:0] (8)
10: mad u32 $r2 $r2 c7[0xec] $r4 (8)
11: rdsv u32 $r4 sv[TID:0] (8)
12: mad u32 $r2 $r2 c7[0xe0] $r4 (8)
13: mov u32 $r4 0x0000000c (8)
14: mad u32 $r2 $r2 $r4 c0[0x0] (8)
15: st u32 # g[$r2+0x0] $r3 (8)
16: add u32 $r3 $r2 0x00000004 (8)
17: st u32 # g[$r2+0x4] $r0 (8)
18: add u32 $r0 $r2 0x00000008 (8)
19: st u32 # g[$r2+0x8] $r1 (8)
20: ret (8)
Notice that this code also seems to hit
another bug, instructions 16 and 18
got folded into the "st" instructions as offset,
but they did not get deleted.
Any clues for where to start looking at
the root cause of that are welcome.
###
As for Ilia's solution to not disallow
combining of indirect loads for constbufs
given the discussion that seems sensible,
at least for ubo-s, for opencl the
input parameters may end up being indirectly
accessed in an unaligned matter too.
Regards,
Hans
ld u64 r2d c7[r1+0x0]
Where r1 contains the offset of the user-buf. But what if the user is
somehow
indirectly accessing the userbuf, then we will have added that indirect
offset
to r1, and we can no longer assume that we can safely merge the loads
without
breaking alignment rules.
I hope I'm making sense here, I'm still a bit unsure about the details how
this all works.
Regards,
Hans
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev