On Thu, Apr 21, 2016 at 11:40 AM, Hans de Goede <hdego...@redhat.com> wrote: > Hi, > > > On 21-04-16 17:09, Samuel Pitoiset wrote: >> >> >> >> On 04/21/2016 04:46 PM, Hans de Goede wrote: >>> >>> Hi, >>> >>> On 21-04-16 16:28, Ilia Mirkin wrote: >>>> >>>> On Thu, Apr 21, 2016 at 9:55 AM, Hans de Goede <hdego...@redhat.com> >>>> wrote: >>>>> >>>>> combineLd/St would combine, i.e. : >>>>> >>>>> st u32 # g[$r2+0x0] $r2 >>>>> st u32 # g[$r2+0x4] $r3 >>>>> >>>>> into: >>>>> >>>>> st u64 # g[$r2+0x0] $r2d >>>>> >>>>> But this is only valid if r2 contains an 8 byte aligned address, >>>>> which is unknown. >>>>> >>>>> This commit checks for src0 dim 0 not being indirect when combining >>>>> loads / stores as combining indirect loads / stores may break alignment >>>>> rules. >>>> >>>> >>>> I believe the assumption is that all indirect addresses are 16-byte >>>> aligned. This works out for GL, I think. Although hm... I wonder what >>>> happens if you have a >>>> >>>> layout (std430) buffer foo { >>>> int x[16]; >>>> } >>>> >>>> And you access x[i], x[i+1], and i == 1. I think we end up doing a ton >>>> of size-based validation which might avoid the problem. >>>> >>>> My concern is that now constbufs will get the same treatment, and for >>>> constbufs the alignment is always 16 :( >>>> >>>> What do you think? Just drop those, or add extra conditionals to allow >>>> it for constbufs? >>> >>> >>> I'm not sure we've the alignment guarantee for constbufs, IIRC we lower >>> const buf accesses to be indirect because we want to provide more then 8 >>> UBO-s, >>> right ? So we read the offset from NVC0_CB_AUX_BUF_INFO and then end up >>> with e.g.: >> >> >> Right. This is because the launch descriptor used for compute shaders on >> kepler only allows to set up 8 CBs. But OpenGL requires at least 14 UBOs, so >> the logic is to stick UBOs' information into the driver constant buffer. >> >> As you can, we do this dance for all UBOs because it's simpler that >> testing if an UBO has been described in the launch descriptor or not (so if >> it's mapped as c1[], c2[], etc). >> >> The lowering pass should properly handle indirect UBO accesses (I did >> write a piglit test for that and looked at blob). But I'm not sure if we can >> break alignment here. >> >> Do you have a simple shader that might hit the issue? > > > I'm definitely hitting the issue with opencl programs, > specifically with: > > piglit/tests/cl/program/execute/get-num-groups.cl > > Which contains: > > kernel void fill3d(global int* out) { > unsigned int id = get_global_id(0) + > get_global_size(0)*get_global_id(1 > out[3*id] = get_num_groups(0); > out[3*id+1] = get_num_groups(1); > out[3*id+2] = get_num_groups(2); > } > > Notice the 3 * id, we end up combining > get_num_groups(0) and get_num_groups(1) > into a single 64 bit store, which for > (id % 2 == 1) results in an unaligned trap > on the gpu. > > Interestingly enough this is the only piglet cl > test which triggers this, but still this is a real > problem AFAICT. > > Note this gets translated into: > > COMP > DCL SV[0], BLOCK_ID > DCL SV[1], BLOCK_SIZE > DCL SV[2], GRID_SIZE > DCL SV[3], THREAD_ID > DCL MEMORY[0], GLOBAL > DCL MEMORY[1], SHARED > DCL MEMORY[2], PRIVATE > DCL MEMORY[3], INPUT > > IMM[0] UINT32 {2, 0, 0, 0} > IMM[1] UINT32 {0, 0, 0, 0} > 0: BGNSUB :0 > 1: UMUL TEMP[1].x, SV[1].xxxx, SV[0].xxxx > 2: UADD TEMP[1].x, SV[3].xxxx, TEMP[1].xxxx > 3: SHL TEMP[1].x, TEMP[1].xxxx, IMM[0].xxxx > 4: LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1] > 5: UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx > 6: STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx > 7: RET > 8: ENDSUB > IMM[2] UINT32 {3, 0, 0, 0} > IMM[3] UINT32 {4, 0, 0, 0} > 9: BGNSUB :0 > 10: UMUL TEMP[1].x, SV[1].yyyy, SV[0].yyyy > 11: UADD TEMP[1].x, SV[3].yyyy, TEMP[1].xxxx > 12: UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx > 13: UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx > 14: UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx > 15: UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx > 16: SHL TEMP[1].x, TEMP[1].xxxx, IMM[2].xxxx > 17: LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1] > 18: UADD TEMP[1].z, TEMP[1].yyyy, TEMP[1].xxxx > 19: STORE MEMORY[0].x, TEMP[1].zzzz, SV[2].xxxx > 20: OR TEMP[1].x, TEMP[1].xxxx, IMM[3].xxxx > 21: UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx > 22: STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].yyyy > 23: RET > 24: ENDSUB > IMM[4] UINT32 {12, 0, 0, 0} > IMM[5] UINT32 {8, 0, 0, 0} > 25: BGNSUB :0 > 26: UMUL TEMP[1].x, SV[1].zzzz, SV[0].zzzz > 27: UADD TEMP[1].x, SV[3].zzzz, TEMP[1].xxxx > 28: UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].yyyy > 29: UADD TEMP[1].x, TEMP[1].xxxx, SV[0].yyyy > 30: UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].yyyy > 31: UADD TEMP[1].x, TEMP[1].xxxx, SV[3].yyyy > 32: UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx > 33: UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx > 34: UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx > 35: UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx > 36: UMUL TEMP[1].x, TEMP[1].xxxx, IMM[4].xxxx > 37: LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1] > 38: UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx > 39: STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx > 40: UADD TEMP[1].y, TEMP[1].xxxx, IMM[3].xxxx > 41: STORE MEMORY[0].x, TEMP[1].yyyy, SV[2].yyyy > 42: UADD TEMP[1].x, TEMP[1].xxxx, IMM[5].xxxx > 43: STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].zzzz > 44: RET > 45: ENDSUB > > With the SUB beginning at 25: corresponding to > the troublesome fill3d function. > > With my fix the generated code for this is: > > 0: rdsv u32 $r0 sv[CTAID:2] (8) > 1: rdsv u32 $r1 sv[TID:2] (8) > 2: mad u32 $r2 $r0 c7[0xe8] $r1 (8) > 3: ld u64 $r0d c7[0xf0] (8) > 4: rdsv u32 $r3 sv[CTAID:1] (8) > 5: mad u32 $r2 $r2 c7[0xf0] $r3 (8) > 6: rdsv u32 $r3 sv[TID:1] (8) > 7: mad u32 $r2 $r2 c7[0xe4] $r3 (8) > 8: ld u32 $r3 c7[0xec] (8) > 9: rdsv u32 $r4 sv[CTAID:0] (8) > 10: mad u32 $r2 $r2 c7[0xec] $r4 (8) > 11: rdsv u32 $r4 sv[TID:0] (8) > 12: mad u32 $r2 $r2 c7[0xe0] $r4 (8) > 13: mov u32 $r4 0x0000000c (8) > 14: mad u32 $r2 $r2 $r4 c0[0x0] (8) > 15: st u32 # g[$r2+0x0] $r3 (8) > 16: add u32 $r3 $r2 0x00000004 (8) > 17: st u32 # g[$r2+0x4] $r0 (8) > 18: add u32 $r0 $r2 0x00000008 (8) > 19: st u32 # g[$r2+0x8] $r1 (8) > 20: ret (8) > > Notice that this code also seems to hit > another bug, instructions 16 and 18 > got folded into the "st" instructions as offset, > but they did not get deleted.
Something must still be using them and they don't get DCE'd. Probably a screwup somewhere? There's a special folding pass which merges such small offsets in... > > Any clues for where to start looking at > the root cause of that are welcome. > > ### > > As for Ilia's solution to not disallow > combining of indirect loads for constbufs > given the discussion that seems sensible, > at least for ubo-s, for opencl the > input parameters may end up being indirectly > accessed in an unaligned matter too. Can OpenCL end up using a UBO? If so, we should just do this for compute. i.e. allow merging of indirect constbuf loads on non-compute shaders. That might be the quickest simplest thing to do irregardless. I doubt this is such a frequent case that this merits more worrying about. -ilia _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev