Re: [PR] Sketch for aggregation intermediate results blocked management [datafusion]

via GitHub Thu, 22 Aug 2024 03:34:25 -0700


2010YOUY01 commented on code in PR #11943:
URL: https://github.com/apache/datafusion/pull/11943#discussion_r1726804168



##########
datafusion/physical-plan/src/aggregates/group_values/row.rs:
##########
@@ -121,16 +135,31 @@ impl GroupValues for GroupValuesRows {
         create_hashes(cols, &self.random_state, batch_hashes)?;
 
         for (row, &target_hash) in batch_hashes.iter().enumerate() {
+            let mode = self.mode;
             let entry = self.map.get_mut(target_hash, |(exist_hash, 
group_idx)| {
                 // Somewhat surprisingly, this closure can be called even if 
the
                 // hash doesn't match, so check the hash first with an integer
                 // comparison first avoid the more expensive comparison with
                 // group value. https://github.com/apache/datafusion/pull/11718
-                target_hash == *exist_hash
-                    // verify that the group that we are inserting with hash is
-                    // actually the same key value as the group in
-                    // existing_idx  (aka group_values @ row)
-                    && group_rows.row(row) == group_values.row(*group_idx)
+                if target_hash != *exist_hash {
+                    return false;
+                }
+
+                // verify that the group that we are inserting with hash is
+                // actually the same key value as the group in
+                // existing_idx  (aka group_values @ row)
+                match mode {
+                    GroupStatesMode::Flat => {
+                        group_rows.row(row)
+                            == group_values.current().unwrap().row(*group_idx)
+                    }
+                    GroupStatesMode::Blocked(_) => {
+                        let blocked_index = BlockedGroupIndex::new(*group_idx);
+                        group_rows.row(row)
+                            == group_values[blocked_index.block_id]

Review Comment:
   My basic understanding for vectorized HT is to build a hash table that can 
operate on a batch of values, and its implementation steps also process a batch 
at a time. It's gonna be efficient since the CPU can run super fast for simple 
loops (due to deep instruction pipeline and cache)
   
   A regular vs. vectorized hash table lookup implementation looks like this:
   hashing 1 key -> find initial slot for 1 key -> linear probing for 1 key -> 
...
   hashing 8192 keys -> find initial slot for 8192 keys -> linear probing for 
8192 keys -> ...
   
   The regular hash table approach will inevitably cause branch mispredictions 
and cache misses within a single operation to slow things down.  The vectorized 
version's execution consists of several tight loops (looping thousands of 
times) so that the CPU can run at full speed within loops.
   
   Now the aggregation has already vectorized the first step -- hashing, the 
vectorized HT approach intends to continue vectorizing the following steps
   Though it would be a super complex algorithm basically write a batched hash 
table from scratch, I haven't thought about the specific algorithm either
   #7095 quoted one paper section describing Photon's algorithm's high-level 
idea
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Sketch for aggregation intermediate results blocked management [datafusion]

Reply via email to