I found that there is something in this PR (the last change to llvm_generator.cc) that broke the auto vectorization. I'll debug some more tomorrow.
https://github.com/apache/arrow/commit/165b02d2358e5c8c2039cf626ac7326d82e3ca90 If I undo this one patch, I can see the vectorization happen with Yibo Cai's test. *vector.ph <http://vector.ph>: ; preds = %vector.memcheck* * %n.vec = and i64 %smax, 9223372036854775800* * br label %vector.body* *vector.body: ; preds = %vector.body, %vector.ph <http://vector.ph>* * %index = phi i64 [ 0, %vector.ph <http://vector.ph> ], [ %index.next, %vector.body ]* * %3 = getelementptr i32, i32* %f0_mem6, i64 %index* * %4 = bitcast i32* %3 to <4 x i32>** * %wide.load = load <4 x i32>, <4 x i32>* %4, align 4, !alias.scope !3* * %5 = getelementptr i32, i32* %3, i64 4* * %6 = bitcast i32* %5 to <4 x i32>** * %wide.load14 = load <4 x i32>, <4 x i32>* %6, align 4, !alias.scope !3* * %7 = add nsw <4 x i32> %wide.load, <i32 1, i32 1, i32 1, i32 1>* * %8 = add nsw <4 x i32> %wide.load14, <i32 1, i32 1, i32 1, i32 1>* * %9 = getelementptr i32, i32* %add_mem5, i64 %index* * %10 = bitcast i32* %9 to <4 x i32>** * store <4 x i32> %7, <4 x i32>* %10, align 4, !alias.scope !6, !noalias !3* * %11 = getelementptr i32, i32* %9, i64 4* * %12 = bitcast i32* %11 to <4 x i32>** * store <4 x i32> %8, <4 x i32>* %12, align 4, !alias.scope !6, !noalias !3* * %index.next = add i64 %index, 8* * %13 = icmp eq i64 %index.next, %n.vec* * br i1 %13, label %middle.block, label %vector.body, !llvm.loop !8* On Wed, Dec 11, 2019 at 11:25 PM Ravindra Pindikura <ravin...@dremio.com> wrote: > I'll debug this and get back to you - I suspect some recent change broke > this functionality. > > On Wed, Dec 11, 2019 at 10:06 PM Francois Saint-Jacques < > fsaintjacq...@gmail.com> wrote: > >> It seems that LLVM can't auto vectorize. I don't have a debug build, >> so I can't get the `-debug-only` information from llvm-opt/opt about >> why it can't vectorize. The buffer address mangling should be hoisted >> out of the loop (still doesn't enable auto vectorization) [1]. The >> buffer juggling should be moved into a separate function, and the >> useless parameters elided. >> >> [1] https://godbolt.org/z/KHdrfD >> >> François >> >> >> On Wed, Dec 11, 2019 at 5:39 AM Yibo Cai <yibo....@arm.com> wrote: >> > >> > Hi, >> > >> > I'm trying to figure out how Gandiva works by tracing unit test >> TestSimpleArichmetic[1]. >> > I met with a problem about Gandiva IR generator and optimizer, would >> like to seek for >> > help from community. >> > >> > I'm focusing on case "b+1", which adds 1 to each element of an int32 >> vector. I see >> > there's a precompiled IR for simple int32+int32 scalar sum [2]. The >> vector sum IR >> > code is generated at runtime in LLVMGenerator::CodeGenExprValue [3], >> which injects IR >> > of loops and other controls. Then the IR goes optimization pass to >> generate optimized >> > IR [4], and finally machine code and relocation. >> > >> > For the above vector sum case, I think compiler should do loop >> vectorize. According >> > optimization options aer enabled in Gandiva. But I didn't see >> vectorized code in both >> > Gandiva IR and machine code. I'm wondering what's the problem. >> > >> > Below is my debug logs. Pseudo C code [5] is the according C code >> Gandiva generates IR. >> > I pasted links to clang-7 generated assembly and IR code for >> comparison. clang generated >> > code has loop vectorization. I also dumped IR genereted by Gandiva, >> before and after >> > optimization, and Gandiva generated assembly code. They don't have loop >> vectorization. >> > >> > >> > ====Pseudo C code >> [5]========================================================== >> > int expr_0_0(int64_t *addrs, int64_t *local_bitmaps, >> > int64_t execution_context_ptr, int64_t nrecords) { >> > int *outVec = (int *) addrs[5]; >> > int *c0Vec = (int *) addrs[1]; >> > int *c1Vec = (int *) addrs[3]; >> > for (int loop_var = 0; loop_var < nrecords; ++loop_var) { >> > int c0 = c0Vec[loop_var]; >> > int c1 = c1Vec[loop_var]; >> > int out = c0 + c1; >> > outVec[loop_var] = out; >> > } >> > } >> > >> > >> > ====clang-7.0.0 generated code(has loop >> vectorize)============================= >> > assembly code: https://godbolt.org/z/_bAzSs >> > IR code: https://pastebin.com/NGc8eCMq >> > >> > >> > ====Gandiva generated IR(before >> optimize)====================================== >> > Full IR: https://pastebin.com/YEKDCNVR >> > >> > define internal i32 @add_int32_int32(i32, i32) local_unnamed_addr #0 { >> > %3 = add nsw i32 %1, %0 >> > ret i32 %3 >> > } >> > >> > define i32 @expr_0_0(i64* %args, i64* %arg_addr_offsets, i64* >> %local_bitmaps, i16* %selection_vector, i64 %context_ptr, i64 %nrecords) { >> > entry: >> > %"b+1_mem_addr" = getelementptr i64, i64* %args, i32 0 >> > %"b+1_mem" = load i64, i64* %"b+1_mem_addr" >> > %"b+1_darray" = inttoptr i64 %"b+1_mem" to i32* >> > %"b+1_mem_addr1" = getelementptr i64, i64* %args, i32 2 >> > %"b+1_mem2" = load i64, i64* %"b+1_mem_addr1" >> > %"b+1_buf_ptr" = inttoptr i64 %"b+1_mem2" to i8* >> > %"b+1_mem_addr3" = getelementptr i64, i64* %args, i32 -1 >> > %"b+1_mem4" = load i64, i64* %"b+1_mem_addr3" >> > %"b+1_oarray" = inttoptr i64 %"b+1_mem4" to i32* >> > %b_mem_addr = getelementptr i64, i64* %args, i32 3 >> > %b_mem = load i64, i64* %b_mem_addr >> > %b_darray = inttoptr i64 %b_mem to i32* >> > br label %loop >> > >> > loop: ; preds = %loop, >> %entry >> > %loop_var = phi i64 [ 0, %entry ], [ %"loop_var+1", %loop ] >> > %b_offset_addr = getelementptr i64, i64* %arg_addr_offsets, i32 3 >> > %b_addr = load i64, i64* %b_offset_addr >> > %0 = add i64 %loop_var, %b_addr >> > %1 = getelementptr i32, i32* %b_darray, i64 %0 >> > %b = load i32, i32* %1 >> > %add_int32_int32 = call i32 @add_int32_int32(i32 %b, i32 1) >> > %2 = getelementptr i32, i32* %"b+1_darray", i64 %loop_var >> > store i32 %add_int32_int32, i32* %2 >> > %"loop_var+1" = add i64 %loop_var, 1 >> > %"loop_var < nrec" = icmp slt i64 %"loop_var+1", %nrecords >> > br i1 %"loop_var < nrec", label %loop, label %exit >> > >> > exit: ; preds = %loop >> > ret i32 0 >> > } >> > >> > >> > ====Gandiva generated IR(after >> optimize)======================================= >> > Full IR: https://pastebin.com/a0CRS0Ud >> > >> > define i32 @expr_0_0(i64* nocapture readonly %args, i64* nocapture >> readonly %arg_addr_offsets, i64* nocapture readnone %local_bitmaps, i16* >> nocapture readnone %selection_vector, i64 %context_ptr, i64 %nrecords) >> local_unnamed_addr #0 { >> > entry: >> > %0 = bitcast i64* %args to i32** >> > %"b+1_mem5" = load i32*, i32** %0, align 8 >> > %b_mem_addr = getelementptr i64, i64* %args, i64 3 >> > %1 = bitcast i64* %b_mem_addr to i32** >> > %b_mem6 = load i32*, i32** %1, align 8 >> > %b_offset_addr = getelementptr i64, i64* %arg_addr_offsets, i64 3 >> > br label %loop >> > >> > loop: ; preds = %loop, >> %entry >> > %loop_var = phi i64 [ 0, %entry ], [ %"loop_var+1", %loop ] >> > %b_addr = load i64, i64* %b_offset_addr, align 8 >> > %2 = add i64 %b_addr, %loop_var >> > %3 = getelementptr i32, i32* %b_mem6, i64 %2 >> > %b = load i32, i32* %3, align 4 >> > %4 = add nsw i32 %b, 1 >> > %5 = getelementptr i32, i32* %"b+1_mem5", i64 %loop_var >> > store i32 %4, i32* %5, align 4 >> > %"loop_var+1" = add nuw nsw i64 %loop_var, 1 >> > %"loop_var < nrec" = icmp slt i64 %"loop_var+1", %nrecords >> > br i1 %"loop_var < nrec", label %loop, label %exit >> > >> > exit: ; preds = %loop >> > ret i32 0 >> > } >> > >> > >> > ====Gandiva generated assembly code (no loop >> vectorize)======================== >> > Dump of assembler code for function expr_0_0: >> > 0x00007ffff7ff4000 <+0>: mov (%rdi),%rax >> > 0x00007ffff7ff4003 <+3>: mov 0x18(%rdi),%rcx >> > 0x00007ffff7ff4007 <+7>: xor %edx,%edx >> > 0x00007ffff7ff4009 <+9>: nop >> > 0x00007ffff7ff400a <+10>: nop >> > 0x00007ffff7ff400b <+11>: nop >> > 0x00007ffff7ff400c <+12>: nop >> > 0x00007ffff7ff400d <+13>: nop >> > 0x00007ffff7ff400e <+14>: nop >> > 0x00007ffff7ff400f <+15>: nop >> > 0x00007ffff7ff4010 <+16>: mov 0x18(%rsi),%rdi >> > 0x00007ffff7ff4014 <+20>: add %rdx,%rdi >> > 0x00007ffff7ff4017 <+23>: mov (%rcx,%rdi,4),%edi >> > 0x00007ffff7ff401a <+26>: inc %edi >> > 0x00007ffff7ff401c <+28>: mov %edi,(%rax,%rdx,4) >> > 0x00007ffff7ff401f <+31>: inc %rdx >> > 0x00007ffff7ff4022 <+34>: cmp %r9,%rdx >> > 0x00007ffff7ff4025 <+37>: jl 0x7ffff7ff4010 <expr_0_0+16> >> > 0x00007ffff7ff4027 <+39>: xor %eax,%eax >> > 0x00007ffff7ff4029 <+41>: retq >> > >> > >> > [1] >> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/tests/literal_test.cc#L42 >> > [2] >> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/precompiled/arithmetic_ops.cc#L65 >> > [3] >> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/llvm_generator.cc#L247 >> > [4] >> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/engine.cc#L207-L216 >> > [5] >> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/llvm_generator.cc#L204-L215 >> > > > -- > Thanks and regards, > Ravindra. > -- Thanks and regards, Ravindra.