Dandandan commented on PR #19464: URL: https://github.com/apache/datafusion/pull/19464#issuecomment-3691207724
> Thanks @Dandandan for addressing this, I'm planning to run TPCDS query set later this week. After first glance the PR makes sense , so you keep TopK values in map[string, usize] where the value points to correspondent TopK value in `store`. > > some thoughts aloud: > > * Do you see any reason if `map` and `store` would be out of sync? > * `free_index` is to track removed items? so this points to available spot for new entry in the `store`, do you think it can be overwritten in concurrent environment and effectively we losing the spot that points to overwritten id forever? 1. I don't see why they would be out of sync, the store stays just in place (it can only grow) and the indices remain valid. In case of an implementation bug it would panic (out of bounds) or give wrong results instead of UB we had before. 2. It is to track the latest removed item before it is added again (removing the worst from the group before adding new) . The code isn't used in a concurrent way I believe. Perhaps it can be refactored a bit to return the removed id to caller, or perhaps even store the items directly in the heap instead of separate `Vec` or use the hashtable API in smarter ways. The goal in this PR is however to remove `RawTable` without doing much changes / refactoring, so I prefer to keep changes minimal as possible to achieve that goal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
