[PR] fix: `collect_columns` quadratic complexity [datafusion]

via GitHub Tue, 06 Aug 2024 01:52:52 -0700


crepererum opened a new pull request, #11843:
URL: https://github.com/apache/datafusion/pull/11843


   ## Which issue does this PR close?
   \-
   
   ## Rationale for this change
   Fix accidental $O(n^2)$ in `collect_columns`.
   
   ## What changes are included in this PR?
   There are the following ways to insert a clone into a hash set:
   
   - **clone before check:** Basically `set.insert(x.clone())`. That's rather 
expensive if you have a lot of duplicates.
   - **iter & clone:** That's what we do right now, but that's in $O(n^2)$.
   - **check & insert:** Basically `if !set.contains(x) 
{set.insert(x.clone())}` That requires two hash probes though.
   - **entry-based API:** Sadly the stdlib doesn't really offer any 
handle/entry-based APIs yet (see https://github.com/rust-lang/rust/issues/60896 
), but `hashbrown` does, so we can use the nice `set.get_or_insert_owned(x)` 
which will only clone the reference if it doesn't exists yet and only hashes 
once.
   
   We now use the last approach.
   
   ## Are these changes tested?
   Functional tests still pass, performance change is only visible for large 
inputs (like 100+ columns).
   
   ## Are there any user-facing changes?
   **:warning: Breaking change:** `collect_columns` now returns 
`hashbrown::HashSet` instead of `std::collections::HashSet`!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix: `collect_columns` quadratic complexity [datafusion]

Reply via email to