UBarney commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2637851683
##########
datafusion/common/src/config.rs:
##########
@@ -468,6 +468,25 @@ config_namespace! {
/// metadata memory consumption
pub batch_size: usize, default = 8192
+ /// A perfect hash join will be considered if the number of rows on
the build
+ /// side is below this threshold. This provides a fast path for joins
with
+ /// very small build sides, bypassing the density check.
+ ///
+ /// TODO: Currently only supports cases where left_side.num_rows() <
u32::MAX.
+ /// Support for left_side.num_rows() >= u32::MAX will be added in the
future.
+ pub perfect_hash_join_small_build_threshold: usize, default = 1024
+
+ /// The minimum required density of join keys on the build side to
consider a
+ /// perfect hash join. Density is calculated as:
+ /// `(number of rows) / (max_key - min_key + 1)`.
+ /// A perfect hash join may be used if the actual key density exceeds
this
+ /// value. For example, a value of 0.99 means the keys must fill at
least
+ /// 99% of their value range.
+ ///
+ /// TODO: Currently only supports cases where left_side.num_rows() <
u32::MAX.
+ /// Support for left_side.num_rows() >= u32::MAX will be added in the
future.
+ pub perfect_hash_join_min_key_density: f64, default = 0.99
Review Comment:
I ran a quick test on performance (time taken) at different densities, and
even at a density of 0.1, ArrayMap was still faster than HashMap ๐ค. (I haven't
measured memory usage yet, though.)
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโณโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโ
โ Query โ density=6 โ density=0.1 โ
Change โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ QQuery 1_density=1_prob_hit=1_25*1.5M โ 5.83 ms โ 4.96 ms โ
+1.18x faster โ
โ QQuery 2_density=0.026_prob_hit=1_25*1.5M โ 6.56 ms โ 4.58 ms โ
+1.43x faster โ
โ QQuery 3_density=1_prob_hit=1_100K*60M โ 134.90 ms โ 94.36 ms โ
+1.43x faster โ
โ QQuery 4_density=1_prob_hit=0.1_100K*60M โ 167.25 ms โ 106.93 ms โ
+1.56x faster โ
โ QQuery 5_density=0.75_prob_hit=1_100K*60M โ 141.52 ms โ 93.04 ms โ
+1.52x faster โ
โ QQuery 6_density=0.75_prob_hit=0.1_100K*60M โ 248.02 ms โ 193.13 ms โ
+1.28x faster โ
โ QQuery 7_density=0.5_prob_hit=1_100K*60M โ 132.51 ms โ 82.83 ms โ
+1.60x faster โ
โ QQuery 8_density=0.5_prob_hit=0.1_100K*60M โ 226.32 ms โ 171.06 ms โ
+1.32x faster โ
โ QQuery 9_density=0.2_prob_hit=1_100K*60M โ 130.31 ms โ 96.29 ms โ
+1.35x faster โ
โ QQuery 10_density=0.2_prob_hit=0.1_100K*60M โ 231.85 ms โ 184.09 ms โ
+1.26x faster โ
โ QQuery 11_density=0.1_prob_hit=1_100K*60M โ 129.65 ms โ 105.23 ms โ
+1.23x faster โ
โ QQuery 12_density=0.1_prob_hit=0.1_100K*60M โ 235.93 ms โ 191.80 ms โ
+1.23x faster โ
โ QQuery 13_density=0.01_prob_hit=1_100K*60M โ 128.74 ms โ 129.35 ms โ
no change โ
โ QQuery 14_density=0.01_prob_hit=0.1_100K*60M โ 230.82 ms โ 233.62 ms โ
no change โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโ
โ Benchmark Summary โ โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Total Time (density=6) โ 2150.21ms โ
โ Total Time (density=0.1) โ 1691.27ms โ
โ Average Time (density=6) โ 153.59ms โ
โ Average Time (density=0.1) โ 120.80ms โ
โ Queries Faster โ 12 โ
โ Queries Slower โ 0 โ
โ Queries with No Change โ 2 โ
โ Queries with Failure โ 0 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโ
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]