zhuqi-lucas commented on PR #14902: URL: https://github.com/apache/datafusion/pull/14902#issuecomment-2685544998
Try to reproduce the https://github.com/apache/datafusion/issues/13765 But current main branch, our join passed! It takes about 50s, it's a good result! cc @alamb @2010YOUY01 ```rust ./bench.sh data h2o_big_join ``` ```rust cargo run --release --bin dfbench -- h2o --mem-pool-type fair --memory-limit 16G --iterations 3 --join-paths /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e9_NA_0.csv,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e9_1e3_0.csv,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e9_1e6_0.csv,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e9_1e9_NA.csv --queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/h2o/join.sql -o /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_14867/h2o_join.json Finished `release` profile [optimized] target(s) in 0.19s Running `/Users/zhuqi/arrow-datafusion/target/release/dfbench h2o --mem-pool-type fair --memory-limit 16G --iterations 3 --join-paths /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e9_NA_0.csv,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e9_1e3_0.csv,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e9_1e6_0.csv,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e9_1e9_NA.csv --queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/h2o/join.sql -o /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_14867/h2o_join.json` Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 3, partitions: None, batch_size: 8192, mem_pool_type: "fair", memory_limit: Some(17179869184), sort_spill_reservation_bytes: None, debug: false }, queries_path: "/Users/zhuqi/arrow-datafusion/benchmarks/queries/h2o/join.sql", path: "benchmarks/data/h2o/G1_1e7_1e7_100_0.csv", join_paths: "/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e9_NA_0.csv,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e9_1e3_0.csv,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e9_1e6_0.csv,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e9_1e9_NA.csv", output_path: Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/issue_14867/h2o_join.json") } Q1: SELECT x.id1, x.id2, x.id3, x.id4 as xid4, small.id4 as smallid4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1; Query 1 iteration 1 took 38.1 ms and returned 900 rows Query 1 iteration 2 took 3.3 ms and returned 900 rows Query 1 iteration 3 took 2.1 ms and returned 900 rows Q2: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x INNER JOIN medium ON x.id2 = medium.id2; Query 2 iteration 1 took 46.1 ms and returned 912 rows Query 2 iteration 2 took 18.4 ms and returned 912 rows Query 2 iteration 3 took 18.6 ms and returned 912 rows Q3: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x LEFT JOIN medium ON x.id2 = medium.id2; Query 3 iteration 1 took 18.2 ms and returned 1000 rows Query 3 iteration 2 took 18.5 ms and returned 1000 rows Query 3 iteration 3 took 17.7 ms and returned 1000 rows Q4: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x JOIN medium ON x.id5 = medium.id5; Query 4 iteration 1 took 17.8 ms and returned 912 rows Query 4 iteration 2 took 18.1 ms and returned 912 rows Query 4 iteration 3 took 17.6 ms and returned 912 rows Q5: SELECT x.id1 as xid1, large.id1 as largeid1, x.id2 as xid2, large.id2 as largeid2, x.id3, x.id4 as xid4, large.id4 as largeid4, x.id5 as xid5, large.id5 as largeid5, x.id6 as xid6, large.id6 as largeid6, x.v1, large.v2 FROM x JOIN large ON x.id3 = large.id3; Query 5 iteration 1 took 49496.6 ms and returned 906 rows Query 5 iteration 2 took 49838.1 ms and returned 906 rows Query 5 iteration 3 took 49552.0 ms and returned 906 rows ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org