Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

via GitHub Mon, 07 Apr 2025 11:19:12 -0700


acking-you commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2784190060


   > I think the likely part of it getting slower is short-circuiting whenever 
a `true` is observed (having a branch for each item to check).
   > 
   > For this case it might be interesting to compare it with 
`array.values().bit_chunks().iter_padded().fold(true, |acc, i: u64| i != 0 && 
acc)` (e.g. fully evaluate the entire array) and see if it is any faster than 
the `true_count` implementation.
   > 
   > For `max_boolean` it might be interesting to see if we can make it faster 
by doing both, early exiting and generating more efficient code. Maybe it could 
evaluate it per `n` values instead of per-value to benefit from faster code / 
branch prediction while still exiting early enough.
   
   I’m certain that the semantics of the `if` statement, which preemptively 
exits based on a condition, caused the compiler to abandon SIMD optimization. 
To test this, I conducted the following experiment:  
   I wrote two simple functions to simulate the two scenarios and then examined 
the optimized assembly code generated by the compiler to compare the 
differences.  
   You can view the specific assembly code generated from my simulated 
experiment here: 
[https://godbolt.org/z/63bEvYP4d](https://godbolt.org/z/63bEvYP4d).
   
   ## Records
   Rust code as follows：
   ```rust
   /// Scene 1: Counting all set bits (the number of 1s)
   #[no_mangle]
   #[inline(never)]
   pub fn count_ones(ba: &[u8]) -> usize {
       ba.iter().map(|x| x.count_ones() as usize).sum()
   }
   
   /// Scene 2: Check if a bit is set
   #[no_mangle]
   #[inline(never)]
   pub fn find_any(ba: &[u8]) -> bool {
       ba.iter().any(|&x| x != 0)
   }
   ```
   Assembly code as follows:
   ```asm
   count_ones:
           test    rsi, rsi
           je      .LBB0_1
           cmp     rsi, 4
           jae     .LBB0_5
           xor     ecx, ecx
           xor     eax, eax
           jmp     .LBB0_8
   .LBB0_1:
           xor     eax, eax
           ret
   .LBB0_5:
           mov     rcx, rsi
           and     rcx, -4
           pxor    xmm0, xmm0
           xor     eax, eax
           movdqa  xmm2, xmmword ptr [rip + .LCPI0_0]
           movdqa  xmm3, xmmword ptr [rip + .LCPI0_1]
           movdqa  xmm5, xmmword ptr [rip + .LCPI0_2]
           pxor    xmm4, xmm4
           pxor    xmm1, xmm1
   .LBB0_6:
           movzx   edx, word ptr [rdi + rax]
           movd    xmm7, edx
           movzx   edx, word ptr [rdi + rax + 2]
           movd    xmm6, edx
           punpcklbw       xmm7, xmm0
           punpcklwd       xmm7, xmm0
           punpckldq       xmm7, xmm0
           movdqa  xmm8, xmm7
           psrlw   xmm8, 1
           pand    xmm8, xmm2
           psubb   xmm7, xmm8
           movdqa  xmm8, xmm7
           pand    xmm8, xmm3
           psrlw   xmm7, 2
           pand    xmm7, xmm3
           paddb   xmm7, xmm8
           movdqa  xmm8, xmm7
           psrlw   xmm8, 4
           paddb   xmm8, xmm7
           pand    xmm8, xmm5
           psadbw  xmm8, xmm0
           paddq   xmm4, xmm8
           punpcklbw       xmm6, xmm0
           punpcklwd       xmm6, xmm0
           punpckldq       xmm6, xmm0
           movdqa  xmm7, xmm6
           psrlw   xmm7, 1
           pand    xmm7, xmm2
           psubb   xmm6, xmm7
           movdqa  xmm7, xmm6
           pand    xmm7, xmm3
           psrlw   xmm6, 2
           pand    xmm6, xmm3
           paddb   xmm6, xmm7
           movdqa  xmm7, xmm6
           psrlw   xmm7, 4
           paddb   xmm7, xmm6
           pand    xmm7, xmm5
           psadbw  xmm7, xmm0
           paddq   xmm1, xmm7
           add     rax, 4
           cmp     rcx, rax
           jne     .LBB0_6
           paddq   xmm1, xmm4
           pshufd  xmm0, xmm1, 238
           paddq   xmm0, xmm1
           movq    rax, xmm0
           cmp     rcx, rsi
           je      .LBB0_2
   .LBB0_8:
           movzx   edx, byte ptr [rdi + rcx]
           imul    edx, edx, 134480385
           shr     edx, 3
           and     edx, 286331153
           imul    edx, edx, 286331153
           shr     edx, 28
           add     rax, rdx
           inc     rcx
           cmp     rsi, rcx
           jne     .LBB0_8
   .LBB0_2:
           ret
   
   find_any:
           xor     ecx, ecx
   .LBB1_1:
           mov     rax, rcx
           cmp     rsi, rcx
           je      .LBB1_3
           lea     rcx, [rax + 1]
           cmp     byte ptr [rdi + rax], 0
           je      .LBB1_1
   .LBB1_3:
           cmp     rsi, rax
           setne   al
           ret
   ```
   
   ## Conclusion
   
   1. **Vectorization Optimization Differences**:
      - The assembly code for `count_ones` utilizes SIMD instructions (such as 
`movdqa`, `psrlw`, and `paddb`), achieving automatic vectorization. The 
compiler packs four bytes (32 bits) of data into an XMM register for parallel 
processing, efficiently calculating the number of set bits through bitmasking 
and accumulation operations.
      - This optimization reduces the computational complexity from O(n) to 
O(n/4), significantly decreasing the number of loop iterations.
   
   2. **Early Exit Hinders Optimization**:
      - The `find_any` function needs to return the position of the first 
non-zero byte immediately, preventing the loop from being unrolled into 
fixed-step SIMD operations. The assembly code reveals that it checks each byte 
individually (e.g., `cmp byte ptr [rcx + rdx], 0`), making it unable to 
leverage parallel comparisons.
   
   3. **Branch Prediction Penalty**:
      - The loop in `find_any` contains conditional jumps (e.g., `je .LBB1_1`), 
posing a risk of branch prediction failure in every iteration. In contrast, the 
vectorized code of `count_ones` performs branch-free linear operations, which 
are better suited for modern CPU pipelines.
   
   4. **Compiler Optimization Limitations**:
      - The early exit semantics of `any()` cause the compiler to 
conservatively avoid vectorization, opting for the safest byte-by-byte checks. 
Conversely, the semantics of `sum()` allow the compiler to aggressively 
optimize.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

Reply via email to