Re: [PR] Add H2O.ai Database-like Ops benchmark to dfbench (join support) [datafusion]

via GitHub Wed, 26 Feb 2025 23:55:35 -0800


zhuqi-lucas commented on PR #14902:
URL: https://github.com/apache/datafusion/pull/14902#issuecomment-2687169996


   Thanks @2010YOUY01 @SemyonSinchenko for review ,  I tried again, it's not a 
problem for me now, and previously may due to my disk is not enough, i cleaned 
up some disk usage.
   
   
   
   
   > Thank you for the benchmark, I've tested it locally and it's working well. 
I have several small suggestions:
   > 
   > 1. Add document for this new join benchmark 
https://github.com/apache/datafusion/tree/main/benchmarks
   > 2. I remember other benchmarks like `TPCH` will display average time, it 
would be great to include it.
   > 
   > > Q1: SELECT x.id1, x.id2, x.id3, x.id4 as xid4, small.id4 as smallid4, 
x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1;
   > > Query 1 iteration 1 took 38.1 ms and returned 900 rows
   > > Query 1 iteration 2 took 3.3 ms and returned 900 rows
   > > Query 1 iteration 3 took 2.1 ms and returned 900 rows
   > 
   > 
https://github.com/apache/datafusion/blob/a28f2834c6969a0c0eb26165031f8baa1e1156a5/benchmarks/src/tpch/run.rs#L166-L167
   > 
   > Regarding the previous Q5 OOM issue, I've tried and it seems consume very 
small memory now (the following command is using the command generated by 
`./bench.sh run h2o_big_join` and append `--query 5`)
   > 
   > ```shell
   > /usr/bin/time -l cargo run --release --bin dfbench -- h2o --iterations 1 
--join-paths 
/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_NA_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e3_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e6_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e9_NA.csv
 --queries-path /Users/yongting/Code/datafusion/benchmarks/queries/h2o/join.sql 
-o /Users/yongting/Code/datafusion/benchmarks/results/pr-14902/h2o_join.json 
--query 5
   >     Finished `release` profile [optimized] target(s) in 0.10s
   >      Running `/Users/yongting/Code/datafusion/target/release/dfbench h2o 
--iterations 1 --join-paths 
/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_NA_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e3_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e6_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e9_NA.csv
 --queries-path /Users/yongting/Code/datafusion/benchmarks/queries/h2o/join.sql 
-o /Users/yongting/Code/datafusion/benchmarks/results/pr-14902/h2o_join.json 
--query 5`
   > Running benchmarks with the following options: RunOpt { query: Some(5), 
common: CommonOpt { iterations: 1, partitions: None, batch_size: 8192, 
mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, 
debug: false }, queries_path: 
"/Users/yongting/Code/datafusion/benchmarks/queries/h2o/join.sql", path: 
"benchmarks/data/h2o/G1_1e7_1e7_100_0.csv", join_paths: 
"/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_NA_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e3_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e6_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e9_NA.csv",
 output_path: 
Some("/Users/yongting/Code/datafusion/benchmarks/results/pr-14902/h2o_join.json")
 }
   > Q5: SELECT x.id1 as xid1, large.id1 as largeid1, x.id2 as xid2, large.id2 
as largeid2, x.id3, x.id4 as xid4, large.id4 as largeid4, x.id5 as xid5, 
large.id5 as largeid5, x.id6 as xid6, large.id6 as largeid6, x.v1, large.v2 
FROM x JOIN large ON x.id3 = large.id3;
   > Query 5 iteration 1 took 47010.3 ms and returned 906 rows
   >        47.21 real       152.57 user        46.83 sys
   >            153337856  maximum resident set size
   >                    0  average shared memory size
   >                    0  average unshared data size
   >                    0  average unshared stack size
   >                16753  page reclaims
   >                  917  page faults
   >                    0  swaps
   >                    0  block input operations
   >                    0  block output operations
   >                    0  messages sent
   >                    0  messages received
   >                    0  signals received
   >              3279460  voluntary context switches
   >              1401644  involuntary context switches
   >        3271486751853  instructions retired
   >         770247407285  cycles elapsed
   >            140297632  peak memory footprint
   > ```
   
   
   This is a good suggestion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add H2O.ai Database-like Ops benchmark to dfbench (join support) [datafusion]

Reply via email to