2010YOUY01 opened a new pull request, #16819:
URL: https://github.com/apache/datafusion/pull/16819

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   - NA
   
   ## Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in 
the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your 
changes and offer better suggestions for fixes.  
   -->
   
   Now, NLJ operator still has some room to improve performance and efficiency 
(less memory consumption), and it has attracted interest from the community (cc 
@jonathanc-n ) recently.
   
   Inspired by the benchmarks used by @UBarney in 
https://github.com/apache/datafusion/pull/16443#issuecomment-2986400295, this 
PR added a similar micro-benchmark for NLJ into the DF benchmark suite.
   
   ## What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is 
sometimes worth providing a summary of the individual changes in this PR.
   -->
   A new micro-benchmark for NLJ in the benchmark suite (`./bench.sh ...`)
   
   The queries and the varied query characteristics can be found in the src.
   
   The special (semi/anti/mark) joins are not included, I'm not sure what's the 
typical workload for those joins.
   
   The bench runner has a validation step to ensure the queries are using NLJ 
in physical plan.
   Also, the optimizer currently does not reorder joins, so the execution order 
follows the join order in the SQL string. (I wish there were an option to 
explicitly enforce this behavior.)
   
   ## Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are 
they covered by existing tests)?
   -->
   
   I tested it locally:
   
   <details>
   
   <summary> Bench Run </summary>
   
   ```sh
   yongting@Yongtings-MacBook-Pro-2 ~/C/d/benchmarks (nlj-bench *)> ./bench.sh 
data nlj
   ***************************
   DataFusion Benchmark Runner and Data Generator
   COMMAND: data
   BENCHMARK: nlj
   DATA_DIR: /Users/yongting/Code/datafusion/benchmarks/data
   CARGO_COMMAND: cargo run --release
   PREFER_HASH_JOIN: true
   ***************************
   NLJ benchmark does not require data generation
   
   yongting@Yongtings-MacBook-Pro-2 ~/C/d/benchmarks (nlj-bench *)> ./bench.sh 
run nlj
   ***************************
   DataFusion Benchmark Script
   COMMAND: run
   BENCHMARK: nlj
   QUERY: All
   DATAFUSION_DIR: /Users/yongting/Code/datafusion/benchmarks/..
   BRANCH_NAME: nlj-bench
   DATA_DIR: /Users/yongting/Code/datafusion/benchmarks/data
   RESULTS_DIR: /Users/yongting/Code/datafusion/benchmarks/results/nlj-bench
   CARGO_COMMAND: cargo run --release
   PREFER_HASH_JOIN: true
   ***************************
   RESULTS_FILE: 
/Users/yongting/Code/datafusion/benchmarks/results/nlj-bench/nlj.json
   Running nlj benchmark...
   + cargo run --release --bin dfbench -- nlj --iterations 5 -o 
/Users/yongting/Code/datafusion/benchmarks/results/nlj-bench/nlj.json
   
   Compiling ...
   
   Running NLJ benchmarks with the following options: RunOpt {
       query_name: None,
       common: CommonOpt {
           iterations: 5,
           partitions: None,
           batch_size: None,
           mem_pool_type: "fair",
           memory_limit: None,
           sort_spill_reservation_bytes: None,
           debug: false,
       },
       output_path: Some(
           
"/Users/yongting/Code/datafusion/benchmarks/results/nlj-bench/nlj.json",
       ),
   }
   
   Query q1 iteration 0 returned 100000 rows in 287.247375ms
   Query q1 iteration 1 returned 100000 rows in 285.833ms
   Query q1 iteration 2 returned 100000 rows in 245.063084ms
   Query q1 iteration 3 returned 100000 rows in 206.90325ms
   Query q1 iteration 4 returned 100000 rows in 207.072917ms
   Query q2 iteration 0 returned 20000000 rows in 254.630083ms
   Query q2 iteration 1 returned 20000000 rows in 246.942708ms
   Query q2 iteration 2 returned 20000000 rows in 239.448709ms
   Query q2 iteration 3 returned 20000000 rows in 240.270583ms
   Query q2 iteration 4 returned 20000000 rows in 251.336291ms
   Query q3 iteration 0 returned 90000000 rows in 446.120291ms
   Query q3 iteration 1 returned 90000000 rows in 453.314375ms
   Query q3 iteration 2 returned 90000000 rows in 358.530208ms
   Query q3 iteration 3 returned 90000000 rows in 394.261916ms
   Query q3 iteration 4 returned 90000000 rows in 453.936083ms
   Query q4 iteration 0 returned 180000000 rows in 1.118616083s
   Query q4 iteration 1 returned 180000000 rows in 1.037793375s
   Query q4 iteration 2 returned 180000000 rows in 952.131541ms
   Query q4 iteration 3 returned 180000000 rows in 962.842834ms
   Query q4 iteration 4 returned 180000000 rows in 1.056383333s
   Query q5 iteration 0 returned 2000000 rows in 572.229083ms
   Query q5 iteration 1 returned 2000000 rows in 611.111917ms
   Query q5 iteration 2 returned 2000000 rows in 836.5735ms
   Query q5 iteration 3 returned 2000000 rows in 622.4575ms
   Query q5 iteration 4 returned 2000000 rows in 579.447708ms
   Query q6 iteration 0 returned 2000000 rows in 9.371356959s
   Query q6 iteration 1 returned 2000000 rows in 6.032997291s
   Query q6 iteration 2 returned 2000000 rows in 5.728677125s
   Query q6 iteration 3 returned 2000000 rows in 6.046709958s
   Query q6 iteration 4 returned 2000000 rows in 5.766419917s
   Query q7 iteration 0 returned 2000000 rows in 790.340125ms
   Query q7 iteration 1 returned 2000000 rows in 654.001709ms
   Query q7 iteration 2 returned 2000000 rows in 860.251ms
   Query q7 iteration 3 returned 2000000 rows in 531.644959ms
   Query q7 iteration 4 returned 2000000 rows in 525.802541ms
   Query q8 iteration 0 returned 2000000 rows in 9.162710916s
   Query q8 iteration 1 returned 2000000 rows in 5.64653225s
   Query q8 iteration 2 returned 2000000 rows in 5.505889417s
   Query q8 iteration 3 returned 2000000 rows in 5.58156175s
   Query q8 iteration 4 returned 2000000 rows in 5.635720625s
   Query q9 iteration 0 returned 900000 rows in 875.642083ms
   Query q9 iteration 1 returned 900000 rows in 655.309166ms
   Query q9 iteration 2 returned 900000 rows in 653.490167ms
   Query q9 iteration 3 returned 900000 rows in 655.535958ms
   Query q9 iteration 4 returned 900000 rows in 655.982292ms
   Query q10 iteration 0 returned 810000000 rows in 2.26567725s
   Query q10 iteration 1 returned 810000000 rows in 2.690937042s
   Query q10 iteration 2 returned 810000000 rows in 3.48998175s
   Query q10 iteration 3 returned 810000000 rows in 3.145351041s
   Query q10 iteration 4 returned 810000000 rows in 5.294884292s
   + set +x
   Done
   
   yongting@Yongtings-MacBook-Pro-2 ~/C/d/benchmarks (nlj-bench *)> ./bench.sh 
compare nlj-bench nlj-bench
   Comparing nlj-bench and nlj-bench
   --------------------
   --------------------
   Benchmark nlj.json
   --------------------
   ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
   ┃ Query        ┃  nlj-bench ┃  nlj-bench ┃    Change ┃
   ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
   │ QQuery q1    │  206.90 ms │  206.90 ms │ no change │
   │ QQuery q2    │  239.45 ms │  239.45 ms │ no change │
   │ QQuery q3    │  358.53 ms │  358.53 ms │ no change │
   │ QQuery q4    │  952.13 ms │  952.13 ms │ no change │
   │ QQuery q5    │  572.23 ms │  572.23 ms │ no change │
   │ QQuery q6    │ 5728.68 ms │ 5728.68 ms │ no change │
   │ QQuery q7    │  525.80 ms │  525.80 ms │ no change │
   │ QQuery q8    │ 5505.89 ms │ 5505.89 ms │ no change │
   │ QQuery q9    │  653.49 ms │  653.49 ms │ no change │
   │ QQuery q10   │ 2265.68 ms │ 2265.68 ms │ no change │
   └──────────────┴────────────┴────────────┴───────────┘
   ```
   
   </details>
   
   ## Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to