crepererum commented on issue #13433: URL: https://github.com/apache/datafusion/issues/13433#issuecomment-2527472455
TBH the last two batches are rather hard: ## `hashbrown` 0.14 & allocation size `hashbrown` 0.14 doesn't expose the allocation size for `HashTable` which we would need here: https://github.com/apache/datafusion/blob/47569b21c50eab42771ca62fd237f429362e8a62/datafusion/physical-plan/src/joins/stream_join_utils.rs#L152-L158 `hashbrown` 0.15 [offers that](https://docs.rs/hashbrown/latest/hashbrown/struct.HashTable.html#method.allocation_size), but then we have to roll the `RawTable`->`HashTable` change together with the upgrade, not as separate steps. ## `ArrowHashTable` This is a bit of a wild interface and I dunno if it's worth the 15% uplift TBH: https://github.com/apache/datafusion/blob/47569b21c50eab42771ca62fd237f429362e8a62/datafusion/physical-plan/src/aggregates/topk/hash_table.rs#L63-L86 The issue here is that `HashTable` doesn't offer an unsafe "directly address the bucket via index" kind of interface because it is fundamentally rather risky. I wonder if we should somewhat redesign this part. As far as I understand, this uses the following concepts: 1. **data heap:** a "heap" (which I guess is just a vector) to store the actual payload data 2. **mutable slot:** a slot that stores a mutable index to the _data heap_. I guess this is used to update the data pointer whenever a new/better value was found. The slots are referenced by a simple `usize` index. 3. **key lookup:** A way to lookup of key->_mutable slot_. The current solution fuses 2 & 3 into a single `RawTable` w/ a LOT of unsafe. I think we could deconstruct that into a `HashTable` (for 3) + `Vec` (for 2). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org