Re: [PR] Add H2O.ai Database-like Ops benchmark to dfbench (join support) [datafusion]

via GitHub Wed, 26 Feb 2025 23:52:26 -0800


2010YOUY01 commented on PR #14902:
URL: https://github.com/apache/datafusion/pull/14902#issuecomment-2687165121


   Thank you for the benchmark, I've tested it locally and it's working well. I 
have several small suggestions:
   1. Add document for this new join benchmark 
https://github.com/apache/datafusion/tree/main/benchmarks
   2. I remember other benchmarks like `TPCH` will display average time, it 
would be great to include it.
   > Q1: SELECT x.id1, x.id2, x.id3, x.id4 as xid4, small.id4 as smallid4, 
x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1;
   > Query 1 iteration 1 took 38.1 ms and returned 900 rows
   > Query 1 iteration 2 took 3.3 ms and returned 900 rows
   > Query 1 iteration 3 took 2.1 ms and returned 900 rows
   
   
https://github.com/apache/datafusion/blob/a28f2834c6969a0c0eb26165031f8baa1e1156a5/benchmarks/src/tpch/run.rs#L166-L167
   
   
   Regarding the previous Q5 OOM issue, I've tried and it seems consume very 
small memory now
   (the following command is using the command generated by `./bench.sh run 
h2o_big_join` and append `--query 5`)
   ```sh
   /usr/bin/time -l cargo run --release --bin dfbench -- h2o --iterations 1 
--join-paths 
/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_NA_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e3_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e6_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e9_NA.csv
 --queries-path /Users/yongting/Code/datafusion/benchmarks/queries/h2o/join.sql 
-o /Users/yongting/Code/datafusion/benchmarks/results/pr-14902/h2o_join.json 
--query 5
       Finished `release` profile [optimized] target(s) in 0.10s
        Running `/Users/yongting/Code/datafusion/target/release/dfbench h2o 
--iterations 1 --join-paths 
/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_NA_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e3_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e6_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e9_NA.csv
 --queries-path /Users/yongting/Code/datafusion/benchmarks/queries/h2o/join.sql 
-o /Users/yongting/Code/datafusion/benchmarks/results/pr-14902/h2o_join.json 
--query 5`
   Running benchmarks with the following options: RunOpt { query: Some(5), 
common: CommonOpt { iterations: 1, partitions: None, batch_size: 8192, 
mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, 
debug: false }, queries_path: 
"/Users/yongting/Code/datafusion/benchmarks/queries/h2o/join.sql", path: 
"benchmarks/data/h2o/G1_1e7_1e7_100_0.csv", join_paths: 
"/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_NA_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e3_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e6_0.csv,/Users/yongting/Code/datafusion/benchmarks/data/h2o/J1_1e9_1e9_NA.csv",
 output_path: 
Some("/Users/yongting/Code/datafusion/benchmarks/results/pr-14902/h2o_join.json")
 }
   Q5: SELECT x.id1 as xid1, large.id1 as largeid1, x.id2 as xid2, large.id2 as 
largeid2, x.id3, x.id4 as xid4, large.id4 as largeid4, x.id5 as xid5, large.id5 
as largeid5, x.id6 as xid6, large.id6 as largeid6, x.v1, large.v2 FROM x JOIN 
large ON x.id3 = large.id3;
   Query 5 iteration 1 took 47010.3 ms and returned 906 rows
          47.21 real       152.57 user        46.83 sys
              153337856  maximum resident set size
                      0  average shared memory size
                      0  average unshared data size
                      0  average unshared stack size
                  16753  page reclaims
                    917  page faults
                      0  swaps
                      0  block input operations
                      0  block output operations
                      0  messages sent
                      0  messages received
                      0  signals received
                3279460  voluntary context switches
                1401644  involuntary context switches
          3271486751853  instructions retired
           770247407285  cycles elapsed
              140297632  peak memory footprint
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add H2O.ai Database-like Ops benchmark to dfbench (join support) [datafusion]

Reply via email to