waynexia commented on PR #15324: URL: https://github.com/apache/datafusion/pull/15324#issuecomment-2750114917
I (am very excited!) just realized we may have overcomplicated things: we specialize in array types to compute hashes and store the value, but we neither need a dedicated hash function (wrapped as xxx set in previous implementation) nor need to store the origin value. We only need to do two things for `count(distinct)` accumulator -- compute and maintain a hashset. Thus I tried another way to rewrite this aggregator, use a uniform accumulator for all types. Do one dispatch for each update to dispatch the actual hash implementation (and this can be eliminated by extracting a type parameter for accumulator). Throw the origin value and only store the hashes in state. This can not only save memory, but also gain a good performance: | Query | Before (ms) | After (ms) | |-------|-------------|------------| | Q0 | 1046.3 | 430.4 | | Q1 | 243.7 | 200.5 | | Q2 | 441.8 | 327.9 | | Count + Count Distinct | 11.828 seconds | 0.515 seconds | p.s. I changed a machine to run them p.p.s I didn't use `bench.sh compare` because it seems not to support selecting test case from help text Some follow-up things: - Make a type parameter for the new general accumulator's implementation, if needed (consider our compile time is quite slow... one dispatch per array seems acceptable) - Use RawTable to further optimize the states, and reduce another hash over `u64` hash values - Maybe remove `PrimitiveDistinctCountAccumulator` and similar implementations? They are not used by us after this patch, but they are part of our public API -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org