2010YOUY01 commented on PR #14902: URL: https://github.com/apache/datafusion/pull/14902#issuecomment-2687165121
Thank you for the benchmark, I've tested it locally and it's working well. I have several small suggestions: 1. Add document for this new join benchmark https://github.com/apache/datafusion/tree/main/benchmarks 2. I remember other benchmarks like `TPCH` will display average time, it would be great to include it. > Q1: SELECT x.id1, x.id2, x.id3, x.id4 as xid4, small.id4 as smallid4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1; > Query 1 iteration 1 took 38.1 ms and returned 900 rows > Query 1 iteration 2 took 3.3 ms and returned 900 rows > Query 1 iteration 3 took 2.1 ms and returned 900 rows https://github.com/apache/datafusion/blob/a28f2834c6969a0c0eb26165031f8baa1e1156a5/benchmarks/src/tpch/run.rs#L166-L167 Regarding the previous Q5 OOM issue, I've tried and it seems consume very small memory now (the following command is using the command generated by `./bench.sh run h2o_big_join` and append `--query 5`) ```sh /usr/bin/time -l cargo run --release --bin dfbench -- h2o --iterations 1 --join-paths /Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_NA_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e3_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e6_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e9_NA.csv --queries-path /Users/yongting/Code/datafusion/benchmarks/queries/h2o/join.sql -o /Users/yongting/Code/datafusion/benchmarks/results/pr-14902/h2o_join.json --query 5 Finished `release` profile [optimized] target(s) in 0.10s Running `/Users/yongting/Code/datafusion/target/release/dfbench h2o --iterations 1 --join-paths /Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_NA_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e3_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e6_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e9_NA.csv --queries-path /Users/yongting/Code/datafusion/benchmarks/queries/h2o/join.sql -o /Users/yongting/Code/datafusion/benchmarks/results/pr-14902/h2o_join.json --query 5` Running benchmarks with the following options: RunOpt { query: Some(5), common: CommonOpt { iterations: 1, partitions: None, batch_size: 8192, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, queries_path: "/Users/yongting/Code/datafusion/benchmarks/queries/h2o/join.sql", path: "benchmarks/data/h2o/G1_1e7_1e7_100_0.csv", join_paths: "/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_NA_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e3_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e6_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e9_NA.csv", output_path: Some("/Users/yongting/Code/datafusion/benchmarks/results/pr-14902/h2o_join.json") } Q5: SELECT x.id1 as xid1, large.id1 as largeid1, x.id2 as xid2, large.id2 as largeid2, x.id3, x.id4 as xid4, large.id4 as largeid4, x.id5 as xid5, large.id5 as largeid5, x.id6 as xid6, large.id6 as largeid6, x.v1, large.v2 FROM x JOIN large ON x.id3 = large.id3; Query 5 iteration 1 took 47010.3 ms and returned 906 rows 47.21 real 152.57 user 46.83 sys 153337856 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 16753 page reclaims 917 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 3279460 voluntary context switches 1401644 involuntary context switches 3271486751853 instructions retired 770247407285 cycles elapsed 140297632 peak memory footprint ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org