waynexia commented on PR #15324:
URL: https://github.com/apache/datafusion/pull/15324#issuecomment-2750114917

   I (am very excited!) just realized we may have overcomplicated things: we 
specialize in array types to compute hashes and store the value, but we neither 
need a dedicated hash function (wrapped as xxx set in previous implementation) 
nor need to store the origin value. We only need to do two things for 
`count(distinct)` accumulator -- compute and maintain a hashset.
   
   Thus I tried another way to rewrite this aggregator, use a uniform 
accumulator for all types. Do one dispatch for each update to dispatch the 
actual hash implementation (and this can be eliminated by extracting a type 
parameter for accumulator). Throw the origin value and only store the hashes in 
state. This can not only save memory, but also gain a good performance:
   
   | Query | Before (ms) | After (ms) |
   |-------|-------------|------------|
   | Q0    | 1046.3     | 430.4    |
   | Q1    | 243.7     | 200.5    |
   | Q2    | 441.8      | 327.9    |
   | Count + Count Distinct | 11.828 seconds | 0.515 seconds |
   
   p.s. I changed a machine to run them
   p.p.s I didn't use `bench.sh compare` because it seems not to support 
selecting test case from help text
   
   Some follow-up things:
   - Make a type parameter for the new general accumulator's implementation, if 
needed (consider our compile time is quite slow... one dispatch per array seems 
acceptable)
   - Use RawTable to further optimize the states, and reduce another hash over 
`u64` hash values
   - Maybe remove `PrimitiveDistinctCountAccumulator` and similar 
implementations? They are not used by us after this patch, but they are part of 
our public API


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to