acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2784190060
> I think the likely part of it getting slower is short-circuiting whenever a `true` is observed (having a branch for each item to check). > > For this case it might be interesting to compare it with `array.values().bit_chunks().iter_padded().fold(true, |acc, i: u64| i != 0 && acc)` (e.g. fully evaluate the entire array) and see if it is any faster than the `true_count` implementation. > > For `max_boolean` it might be interesting to see if we can make it faster by doing both, early exiting and generating more efficient code. Maybe it could evaluate it per `n` values instead of per-value to benefit from faster code / branch prediction while still exiting early enough. I’m certain that the semantics of the `if` statement, which preemptively exits based on a condition, caused the compiler to abandon SIMD optimization. To test this, I conducted the following experiment: I wrote two simple functions to simulate the two scenarios and then examined the optimized assembly code generated by the compiler to compare the differences. You can view the specific assembly code generated from my simulated experiment here: [https://godbolt.org/z/63bEvYP4d](https://godbolt.org/z/63bEvYP4d). ## Records Rust code as follows: ```rust /// Scene 1: Counting all set bits (the number of 1s) #[no_mangle] #[inline(never)] pub fn count_ones(ba: &[u8]) -> usize { ba.iter().map(|x| x.count_ones() as usize).sum() } /// Scene 2: Check if a bit is set #[no_mangle] #[inline(never)] pub fn find_any(ba: &[u8]) -> bool { ba.iter().any(|&x| x != 0) } ``` Assembly code as follows: ```asm count_ones: test rsi, rsi je .LBB0_1 cmp rsi, 4 jae .LBB0_5 xor ecx, ecx xor eax, eax jmp .LBB0_8 .LBB0_1: xor eax, eax ret .LBB0_5: mov rcx, rsi and rcx, -4 pxor xmm0, xmm0 xor eax, eax movdqa xmm2, xmmword ptr [rip + .LCPI0_0] movdqa xmm3, xmmword ptr [rip + .LCPI0_1] movdqa xmm5, xmmword ptr [rip + .LCPI0_2] pxor xmm4, xmm4 pxor xmm1, xmm1 .LBB0_6: movzx edx, word ptr [rdi + rax] movd xmm7, edx movzx edx, word ptr [rdi + rax + 2] movd xmm6, edx punpcklbw xmm7, xmm0 punpcklwd xmm7, xmm0 punpckldq xmm7, xmm0 movdqa xmm8, xmm7 psrlw xmm8, 1 pand xmm8, xmm2 psubb xmm7, xmm8 movdqa xmm8, xmm7 pand xmm8, xmm3 psrlw xmm7, 2 pand xmm7, xmm3 paddb xmm7, xmm8 movdqa xmm8, xmm7 psrlw xmm8, 4 paddb xmm8, xmm7 pand xmm8, xmm5 psadbw xmm8, xmm0 paddq xmm4, xmm8 punpcklbw xmm6, xmm0 punpcklwd xmm6, xmm0 punpckldq xmm6, xmm0 movdqa xmm7, xmm6 psrlw xmm7, 1 pand xmm7, xmm2 psubb xmm6, xmm7 movdqa xmm7, xmm6 pand xmm7, xmm3 psrlw xmm6, 2 pand xmm6, xmm3 paddb xmm6, xmm7 movdqa xmm7, xmm6 psrlw xmm7, 4 paddb xmm7, xmm6 pand xmm7, xmm5 psadbw xmm7, xmm0 paddq xmm1, xmm7 add rax, 4 cmp rcx, rax jne .LBB0_6 paddq xmm1, xmm4 pshufd xmm0, xmm1, 238 paddq xmm0, xmm1 movq rax, xmm0 cmp rcx, rsi je .LBB0_2 .LBB0_8: movzx edx, byte ptr [rdi + rcx] imul edx, edx, 134480385 shr edx, 3 and edx, 286331153 imul edx, edx, 286331153 shr edx, 28 add rax, rdx inc rcx cmp rsi, rcx jne .LBB0_8 .LBB0_2: ret find_any: xor ecx, ecx .LBB1_1: mov rax, rcx cmp rsi, rcx je .LBB1_3 lea rcx, [rax + 1] cmp byte ptr [rdi + rax], 0 je .LBB1_1 .LBB1_3: cmp rsi, rax setne al ret ``` ## Conclusion 1. **Vectorization Optimization Differences**: - The assembly code for `count_ones` utilizes SIMD instructions (such as `movdqa`, `psrlw`, and `paddb`), achieving automatic vectorization. The compiler packs four bytes (32 bits) of data into an XMM register for parallel processing, efficiently calculating the number of set bits through bitmasking and accumulation operations. - This optimization reduces the computational complexity from O(n) to O(n/4), significantly decreasing the number of loop iterations. 2. **Early Exit Hinders Optimization**: - The `find_any` function needs to return the position of the first non-zero byte immediately, preventing the loop from being unrolled into fixed-step SIMD operations. The assembly code reveals that it checks each byte individually (e.g., `cmp byte ptr [rcx + rdx], 0`), making it unable to leverage parallel comparisons. 3. **Branch Prediction Penalty**: - The loop in `find_any` contains conditional jumps (e.g., `je .LBB1_1`), posing a risk of branch prediction failure in every iteration. In contrast, the vectorized code of `count_ones` performs branch-free linear operations, which are better suited for modern CPU pipelines. 4. **Compiler Optimization Limitations**: - The early exit semantics of `any()` cause the compiler to conservatively avoid vectorization, opting for the safest byte-by-byte checks. Conversely, the semantics of `sum()` allow the compiler to aggressively optimize. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org