jaykang10 added a comment.


> I believe the assumption is more practical: most part of upstream llvm 
> targets only support vectors with even sized number of lanes. And in those 
> cases you would have to expand to a 4x vector and leave the 4th element as 
> undef anyway, so it was done in the front-end to get rid of it right away. 
> Probably GPU targets do some special tricks here during legalization.

I have compiled below code, which current clang generates for vec3,  using llc 
with amdgcn target.

LLVM IR (vec3 --> vec4)

  define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly 
%b) {
  entry:
    %castToVec4 = bitcast <3 x float>* %b to <4 x float>*
    %loadVec4 = load <4 x float>, <4 x float>* %castToVec4, align 16
    %storetmp = bitcast <3 x float>* %a to <4 x float>*
    store <4 x float> %loadVec4, <4 x float>* %storetmp, align 16
    ret void
  }

SelectionDAG after legalization.

  Legalized selection DAG: BB#0 'foo:entry'
  SelectionDAG has 43 nodes:
    t0: ch = EntryToken
    t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
      t9: i64 = add t2, Constant:i64<40>
    t10: i32,ch = 
load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t9, 
undef:i64
      t4: i64 = add t2, Constant:i64<36>
    t6: i32,ch = 
load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t4, 
undef:i64
    t12: ch = TokenFactor t6:1, t10:1
        t32: v4i32 = BUILD_VECTOR t22, t25, t27, t29
      t21: v4f32 = bitcast t32
    t18: v4i32 = bitcast t21
    t22: i32,ch = load<LD4[%castToVec4](align=16)> t12, t10, undef:i32
    t24: i32 = add t10, Constant:i32<4>
    t25: i32,ch = load<LD4[%castToVec4+4]> t12, t24, undef:i32
    t26: i32 = add t24, Constant:i32<4>
    t27: i32,ch = load<LD4[%castToVec4+8](align=8)> t12, t26, undef:i32
      t28: i32 = add t26, Constant:i32<4>
    t29: i32,ch = load<LD4[%castToVec4+12]> t12, t28, undef:i32
    t31: ch = TokenFactor t22:1, t25:1, t27:1, t29:1
          t35: i32 = extract_vector_elt t18, Constant:i32<0>
        t36: ch = store<ST4[%storetmp](align=16)> t31, t35, t6, undef:i32
          t38: i32 = extract_vector_elt t18, Constant:i32<1>
          t39: i32 = add t6, Constant:i32<4>
        t40: ch = store<ST4[%storetmp+4]> t31, t38, t39, undef:i32
          t42: i32 = extract_vector_elt t18, Constant:i32<2>
          t44: i32 = add t6, Constant:i32<8>
        t45: ch = store<ST4[%storetmp+8](align=8)> t31, t42, t44, undef:i32
          t47: i32 = extract_vector_elt t18, Constant:i32<3>
          t49: i32 = add t6, Constant:i32<12>
        t50: ch = store<ST4[%storetmp+12]> t31, t47, t49, undef:i32
      t51: ch = TokenFactor t36, t40, t45, t50
    t17: ch = ENDPGM t51

As you can see, the SelectionDAG still has 4 load/store.

Assembly output

        .section        .AMDGPU.config
        .long   47176
        .long   11272257
        .long   47180
        .long   132
        .long   47200
        .long   0
        .long   4
        .long   0
        .long   8
        .long   0
        .text
        .globl  foo
        .p2align        8
        .type   foo,@function
  foo:                                    ; @foo
  ; BB#0:                                 ; %entry
        s_load_dword s2, s[0:1], 0x9
        s_load_dword s0, s[0:1], 0xa
        s_mov_b32 s4, SCRATCH_RSRC_DWORD0
        s_mov_b32 s5, SCRATCH_RSRC_DWORD1
        s_mov_b32 s6, -1
        s_mov_b32 s8, s3
        s_mov_b32 s7, 0xe8f000
        s_waitcnt lgkmcnt(0)
        v_mov_b32_e32 v0, s0
        buffer_load_dword v2, v0, s[4:7], s8 offen
        buffer_load_dword v3, v0, s[4:7], s8 offen offset:4
        buffer_load_dword v4, v0, s[4:7], s8 offen offset:8
        buffer_load_dword v0, v0, s[4:7], s8 offen offset:12
        v_mov_b32_e32 v1, s2
        s_waitcnt vmcnt(0)
        buffer_store_dword v0, v1, s[4:7], s8 offen offset:12
        buffer_store_dword v4, v1, s[4:7], s8 offen offset:8
        buffer_store_dword v3, v1, s[4:7], s8 offen offset:4
        buffer_store_dword v2, v1, s[4:7], s8 offen
        s_endpgm
  .Lfunc_end0:
        .size   foo, .Lfunc_end0-foo
  
        .section        .AMDGPU.csdata

As you can see, assembly also has 4 load/store. I think gpu target does not 
handle it specially.

> My guess here is that targets that do care are looking through the vector 
> shuffle and customizing to whatever seems the best to them. If you wanna 
> change the default behavior you need some data to show that your model solves 
> a real issue and actually brings benefits; do you have any real examples on 
> where that happens, or why GPU targets haven't yet tried to change this? 
> Maybe other custom front-ends based on clang do? Finding the historical 
> reason (if any) should be a good start point.

I did "git blame" and I read the commit's message. You can also see it with 
"c58dcdc8facb646d88675bb6fbcb5c787166c4be". It is same with clang code's 
comment. I also wonder how the vec3 --> vec4 load/store has not caused 
problems. As Akira's example, if struct type has float3 type as one of fields 
and it has 'packed' attribute, it overwrites next field. The vec3 load/store 
generates more instructions like stores and extract_vectors like below 
SelectionDAG.

LLVM IR for vec3

  define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly 
%b) {
  entry:
    %0 = load <3 x float>, <3 x float>* %b, align 16
    store <3 x float> %0, <3 x float>* %a, align 16
    ret void
  }

SelectionDAG after type legalization for amdgcn (other targets has similar 
SelectionDAG after type legalization because of widen vector store on type 
legalizer)

  Type-legalized selection DAG: BB#0 'foo:entry'
  SelectionDAG has 24 nodes:
    t0: ch = EntryToken
    t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
      t4: i64 = add t2, Constant:i64<36>
    t6: i32,ch = 
load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t4, 
undef:i64
      t9: i64 = add t2, Constant:i64<40>
    t10: i32,ch = 
load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t9, 
undef:i64
      t12: ch = TokenFactor t6:1, t10:1
    t21: v4i32,ch = load<LD16[%b]> t12, t10, undef:i32
            t22: v2i64 = bitcast t21
          t24: i64 = extract_vector_elt t22, Constant:i32<0>
        t25: ch = store<ST8[%a](align=16)> t21:1, t24, t6, undef:i32
          t29: i32 = extract_vector_elt t21, Constant:i32<2>
          t27: i32 = add t6, Constant:i32<8>
        t30: ch = store<ST4[%a+8](align=8)> t21:1, t29, t27, undef:i32
      t33: ch = TokenFactor t25, t30
    t17: ch = ENDPGM t33

As you can see, the type legalizer handle vec3 load/store properly. It does not 
write 4th element. The vec3 load/store generates more instructions but it has 
correct behavior. I am not 100% sure the vec3 --> vec4 load/store is correct or 
not because no one has complained about it.  But if the vec3 --> vec4 
load/store is correct, llvm's type legalizer or somewhere on llvm's codegen 
could follow the approach too to generate optimal code. As a result, I think it 
would be good for clang to have both of features and I would like to stick to 
the option "-fpresereve-vec3' to change the behavior easily.


https://reviews.llvm.org/D30810



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to