jaykang10 added a comment.
> I believe the assumption is more practical: most part of upstream llvm > targets only support vectors with even sized number of lanes. And in those > cases you would have to expand to a 4x vector and leave the 4th element as > undef anyway, so it was done in the front-end to get rid of it right away. > Probably GPU targets do some special tricks here during legalization. I have compiled below code, which current clang generates for vec3, using llc with amdgcn target. LLVM IR (vec3 --> vec4) define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly %b) { entry: %castToVec4 = bitcast <3 x float>* %b to <4 x float>* %loadVec4 = load <4 x float>, <4 x float>* %castToVec4, align 16 %storetmp = bitcast <3 x float>* %a to <4 x float>* store <4 x float> %loadVec4, <4 x float>* %storetmp, align 16 ret void } SelectionDAG after legalization. Legalized selection DAG: BB#0 'foo:entry' SelectionDAG has 43 nodes: t0: ch = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0 t9: i64 = add t2, Constant:i64<40> t10: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t9, undef:i64 t4: i64 = add t2, Constant:i64<36> t6: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t4, undef:i64 t12: ch = TokenFactor t6:1, t10:1 t32: v4i32 = BUILD_VECTOR t22, t25, t27, t29 t21: v4f32 = bitcast t32 t18: v4i32 = bitcast t21 t22: i32,ch = load<LD4[%castToVec4](align=16)> t12, t10, undef:i32 t24: i32 = add t10, Constant:i32<4> t25: i32,ch = load<LD4[%castToVec4+4]> t12, t24, undef:i32 t26: i32 = add t24, Constant:i32<4> t27: i32,ch = load<LD4[%castToVec4+8](align=8)> t12, t26, undef:i32 t28: i32 = add t26, Constant:i32<4> t29: i32,ch = load<LD4[%castToVec4+12]> t12, t28, undef:i32 t31: ch = TokenFactor t22:1, t25:1, t27:1, t29:1 t35: i32 = extract_vector_elt t18, Constant:i32<0> t36: ch = store<ST4[%storetmp](align=16)> t31, t35, t6, undef:i32 t38: i32 = extract_vector_elt t18, Constant:i32<1> t39: i32 = add t6, Constant:i32<4> t40: ch = store<ST4[%storetmp+4]> t31, t38, t39, undef:i32 t42: i32 = extract_vector_elt t18, Constant:i32<2> t44: i32 = add t6, Constant:i32<8> t45: ch = store<ST4[%storetmp+8](align=8)> t31, t42, t44, undef:i32 t47: i32 = extract_vector_elt t18, Constant:i32<3> t49: i32 = add t6, Constant:i32<12> t50: ch = store<ST4[%storetmp+12]> t31, t47, t49, undef:i32 t51: ch = TokenFactor t36, t40, t45, t50 t17: ch = ENDPGM t51 As you can see, the SelectionDAG still has 4 load/store. Assembly output .section .AMDGPU.config .long 47176 .long 11272257 .long 47180 .long 132 .long 47200 .long 0 .long 4 .long 0 .long 8 .long 0 .text .globl foo .p2align 8 .type foo,@function foo: ; @foo ; BB#0: ; %entry s_load_dword s2, s[0:1], 0x9 s_load_dword s0, s[0:1], 0xa s_mov_b32 s4, SCRATCH_RSRC_DWORD0 s_mov_b32 s5, SCRATCH_RSRC_DWORD1 s_mov_b32 s6, -1 s_mov_b32 s8, s3 s_mov_b32 s7, 0xe8f000 s_waitcnt lgkmcnt(0) v_mov_b32_e32 v0, s0 buffer_load_dword v2, v0, s[4:7], s8 offen buffer_load_dword v3, v0, s[4:7], s8 offen offset:4 buffer_load_dword v4, v0, s[4:7], s8 offen offset:8 buffer_load_dword v0, v0, s[4:7], s8 offen offset:12 v_mov_b32_e32 v1, s2 s_waitcnt vmcnt(0) buffer_store_dword v0, v1, s[4:7], s8 offen offset:12 buffer_store_dword v4, v1, s[4:7], s8 offen offset:8 buffer_store_dword v3, v1, s[4:7], s8 offen offset:4 buffer_store_dword v2, v1, s[4:7], s8 offen s_endpgm .Lfunc_end0: .size foo, .Lfunc_end0-foo .section .AMDGPU.csdata As you can see, assembly also has 4 load/store. I think gpu target does not handle it specially. > My guess here is that targets that do care are looking through the vector > shuffle and customizing to whatever seems the best to them. If you wanna > change the default behavior you need some data to show that your model solves > a real issue and actually brings benefits; do you have any real examples on > where that happens, or why GPU targets haven't yet tried to change this? > Maybe other custom front-ends based on clang do? Finding the historical > reason (if any) should be a good start point. I did "git blame" and I read the commit's message. You can also see it with "c58dcdc8facb646d88675bb6fbcb5c787166c4be". It is same with clang code's comment. I also wonder how the vec3 --> vec4 load/store has not caused problems. As Akira's example, if struct type has float3 type as one of fields and it has 'packed' attribute, it overwrites next field. The vec3 load/store generates more instructions like stores and extract_vectors like below SelectionDAG. LLVM IR for vec3 define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly %b) { entry: %0 = load <3 x float>, <3 x float>* %b, align 16 store <3 x float> %0, <3 x float>* %a, align 16 ret void } SelectionDAG after type legalization for amdgcn (other targets has similar SelectionDAG after type legalization because of widen vector store on type legalizer) Type-legalized selection DAG: BB#0 'foo:entry' SelectionDAG has 24 nodes: t0: ch = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0 t4: i64 = add t2, Constant:i64<36> t6: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t4, undef:i64 t9: i64 = add t2, Constant:i64<40> t10: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t9, undef:i64 t12: ch = TokenFactor t6:1, t10:1 t21: v4i32,ch = load<LD16[%b]> t12, t10, undef:i32 t22: v2i64 = bitcast t21 t24: i64 = extract_vector_elt t22, Constant:i32<0> t25: ch = store<ST8[%a](align=16)> t21:1, t24, t6, undef:i32 t29: i32 = extract_vector_elt t21, Constant:i32<2> t27: i32 = add t6, Constant:i32<8> t30: ch = store<ST4[%a+8](align=8)> t21:1, t29, t27, undef:i32 t33: ch = TokenFactor t25, t30 t17: ch = ENDPGM t33 As you can see, the type legalizer handle vec3 load/store properly. It does not write 4th element. The vec3 load/store generates more instructions but it has correct behavior. I am not 100% sure the vec3 --> vec4 load/store is correct or not because no one has complained about it. But if the vec3 --> vec4 load/store is correct, llvm's type legalizer or somewhere on llvm's codegen could follow the approach too to generate optimal code. As a result, I think it would be good for clang to have both of features and I would like to stick to the option "-fpresereve-vec3' to change the behavior easily. https://reviews.llvm.org/D30810 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits