crepererum commented on issue #13433:
URL: https://github.com/apache/datafusion/issues/13433#issuecomment-2527472455

   TBH the last two batches are rather hard:
   
   ## `hashbrown` 0.14 & allocation size
   `hashbrown` 0.14 doesn't expose the allocation size for `HashTable` which we 
would need here:
   
   
https://github.com/apache/datafusion/blob/47569b21c50eab42771ca62fd237f429362e8a62/datafusion/physical-plan/src/joins/stream_join_utils.rs#L152-L158
   
   `hashbrown` 0.15 [offers 
that](https://docs.rs/hashbrown/latest/hashbrown/struct.HashTable.html#method.allocation_size),
 but then we have to roll the `RawTable`->`HashTable` change together with the 
upgrade, not as separate steps.
   
   ## `ArrowHashTable`
   This is a bit of a wild interface and I dunno if it's worth the 15% uplift 
TBH:
   
   
https://github.com/apache/datafusion/blob/47569b21c50eab42771ca62fd237f429362e8a62/datafusion/physical-plan/src/aggregates/topk/hash_table.rs#L63-L86
   
   The issue here is that `HashTable` doesn't offer an unsafe "directly address 
the bucket via index" kind of interface because it is fundamentally rather 
risky. I wonder if we should somewhat redesign this part. As far as I 
understand, this uses the following concepts:
   
   1. **data heap:** a "heap" (which I guess is just a vector) to store the 
actual payload data
   2. **mutable slot:** a slot that stores a mutable index to the _data heap_. 
I guess this is used to update the data pointer whenever a new/better value was 
found. The slots are referenced by a simple `usize` index.
   3. **key lookup:** A way to lookup of key->_mutable slot_.
   
   The current solution fuses 2 & 3 into a single `RawTable` w/ a LOT of 
unsafe. I think we could deconstruct that into a `HashTable` (for 3) + `Vec` 
(for 2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to