Re: [PR] Add h2o window benchmark [datafusion]

via GitHub Mon, 12 May 2025 11:05:08 -0700


alamb commented on code in PR #16003:
URL: https://github.com/apache/datafusion/pull/16003#discussion_r2085183657



##########
benchmarks/README.md:
##########
@@ -591,49 +599,46 @@ For example, to run query 1 with the small data generated 
above:
 cargo run --release --bin dfbench -- h2o --path 
./benchmarks/data/h2o/G1_1e7_1e7_100_0.csv  --query 1
 ```
 
-## h2o benchmarks for join
+### h2o benchmarks for join
 
-### Generate data for h2o benchmarks
 There are three options for generating data for h2o benchmarks: `small`, 
`medium`, and `big`. The data is generated in the `data` directory.
 
-1. Generate small data (4 table files, the largest is 1e7 rows)
+Here is a example to generate `small` dataset and run the benchmark. To run 
other 
+dataset size configuration, change the command similar to the previous example.
+
 ```bash
+# Generate small data (4 table files, the largest is 1e7 rows)
 ./bench.sh data h2o_small_join
+
+# Run the benchmark
+./bench.sh run h2o_small_join
 ```
 
+To run a specific query with a specific join data paths, the data paths are 
including 4 table files.
 
-2. Generate medium data (4 table files, the largest is 1e8 rows)
+For example, to run query 1 with the small data generated above:
 ```bash
-./bench.sh data h2o_medium_join
+cargo run --release --bin dfbench -- h2o --join-paths 
./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv
 --queries-path ./benchmarks/queries/h2o/join.sql --query 1
 ```
 
-3. Generate large data (4 table files, the largest is 1e9 rows)
-```bash
-./bench.sh data h2o_big_join
-```
+### h2o benchmarks for window
 
-### Run h2o benchmarks
-There are three options for running h2o benchmarks: `small`, `medium`, and 
`big`.
-1. Run small data benchmark
-```bash
-./bench.sh run h2o_small_join
-```
+H2o window benchmark uses the same dataset as the h2o join benchmark. There 
are three options for generating data for h2o benchmarks: `small`, `medium`, 
and `big`.

Review Comment:
   Can we please make it clear this is an "extended" benchmark (and not part of 
the "standard" h2o benchmark)?



##########
benchmarks/queries/h2o/groupby.sql:
##########
@@ -1,10 +1,19 @@
 SELECT id1, SUM(v1) AS v1 FROM x GROUP BY id1;
+

Review Comment:
   I personally we keep the single line per query even at the expense of 
readability so that the benchmark runners remain consistent. However, I don't 
feel strongly enough to recommend changes to this PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add h2o window benchmark [datafusion]

Reply via email to