2010YOUY01 opened a new pull request, #16819: URL: https://github.com/apache/datafusion/pull/16819
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - NA ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Now, NLJ operator still has some room to improve performance and efficiency (less memory consumption), and it has attracted interest from the community (cc @jonathanc-n ) recently. Inspired by the benchmarks used by @UBarney in https://github.com/apache/datafusion/pull/16443#issuecomment-2986400295, this PR added a similar micro-benchmark for NLJ into the DF benchmark suite. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> A new micro-benchmark for NLJ in the benchmark suite (`./bench.sh ...`) The queries and the varied query characteristics can be found in the src. The special (semi/anti/mark) joins are not included, I'm not sure what's the typical workload for those joins. The bench runner has a validation step to ensure the queries are using NLJ in physical plan. Also, the optimizer currently does not reorder joins, so the execution order follows the join order in the SQL string. (I wish there were an option to explicitly enforce this behavior.) ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> I tested it locally: <details> <summary> Bench Run </summary> ```sh yongting@Yongtings-MacBook-Pro-2 ~/C/d/benchmarks (nlj-bench *)> ./bench.sh data nlj *************************** DataFusion Benchmark Runner and Data Generator COMMAND: data BENCHMARK: nlj DATA_DIR: /Users/yongting/Code/datafusion/benchmarks/data CARGO_COMMAND: cargo run --release PREFER_HASH_JOIN: true *************************** NLJ benchmark does not require data generation yongting@Yongtings-MacBook-Pro-2 ~/C/d/benchmarks (nlj-bench *)> ./bench.sh run nlj *************************** DataFusion Benchmark Script COMMAND: run BENCHMARK: nlj QUERY: All DATAFUSION_DIR: /Users/yongting/Code/datafusion/benchmarks/.. BRANCH_NAME: nlj-bench DATA_DIR: /Users/yongting/Code/datafusion/benchmarks/data RESULTS_DIR: /Users/yongting/Code/datafusion/benchmarks/results/nlj-bench CARGO_COMMAND: cargo run --release PREFER_HASH_JOIN: true *************************** RESULTS_FILE: /Users/yongting/Code/datafusion/benchmarks/results/nlj-bench/nlj.json Running nlj benchmark... + cargo run --release --bin dfbench -- nlj --iterations 5 -o /Users/yongting/Code/datafusion/benchmarks/results/nlj-bench/nlj.json Compiling ... Running NLJ benchmarks with the following options: RunOpt { query_name: None, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false, }, output_path: Some( "/Users/yongting/Code/datafusion/benchmarks/results/nlj-bench/nlj.json", ), } Query q1 iteration 0 returned 100000 rows in 287.247375ms Query q1 iteration 1 returned 100000 rows in 285.833ms Query q1 iteration 2 returned 100000 rows in 245.063084ms Query q1 iteration 3 returned 100000 rows in 206.90325ms Query q1 iteration 4 returned 100000 rows in 207.072917ms Query q2 iteration 0 returned 20000000 rows in 254.630083ms Query q2 iteration 1 returned 20000000 rows in 246.942708ms Query q2 iteration 2 returned 20000000 rows in 239.448709ms Query q2 iteration 3 returned 20000000 rows in 240.270583ms Query q2 iteration 4 returned 20000000 rows in 251.336291ms Query q3 iteration 0 returned 90000000 rows in 446.120291ms Query q3 iteration 1 returned 90000000 rows in 453.314375ms Query q3 iteration 2 returned 90000000 rows in 358.530208ms Query q3 iteration 3 returned 90000000 rows in 394.261916ms Query q3 iteration 4 returned 90000000 rows in 453.936083ms Query q4 iteration 0 returned 180000000 rows in 1.118616083s Query q4 iteration 1 returned 180000000 rows in 1.037793375s Query q4 iteration 2 returned 180000000 rows in 952.131541ms Query q4 iteration 3 returned 180000000 rows in 962.842834ms Query q4 iteration 4 returned 180000000 rows in 1.056383333s Query q5 iteration 0 returned 2000000 rows in 572.229083ms Query q5 iteration 1 returned 2000000 rows in 611.111917ms Query q5 iteration 2 returned 2000000 rows in 836.5735ms Query q5 iteration 3 returned 2000000 rows in 622.4575ms Query q5 iteration 4 returned 2000000 rows in 579.447708ms Query q6 iteration 0 returned 2000000 rows in 9.371356959s Query q6 iteration 1 returned 2000000 rows in 6.032997291s Query q6 iteration 2 returned 2000000 rows in 5.728677125s Query q6 iteration 3 returned 2000000 rows in 6.046709958s Query q6 iteration 4 returned 2000000 rows in 5.766419917s Query q7 iteration 0 returned 2000000 rows in 790.340125ms Query q7 iteration 1 returned 2000000 rows in 654.001709ms Query q7 iteration 2 returned 2000000 rows in 860.251ms Query q7 iteration 3 returned 2000000 rows in 531.644959ms Query q7 iteration 4 returned 2000000 rows in 525.802541ms Query q8 iteration 0 returned 2000000 rows in 9.162710916s Query q8 iteration 1 returned 2000000 rows in 5.64653225s Query q8 iteration 2 returned 2000000 rows in 5.505889417s Query q8 iteration 3 returned 2000000 rows in 5.58156175s Query q8 iteration 4 returned 2000000 rows in 5.635720625s Query q9 iteration 0 returned 900000 rows in 875.642083ms Query q9 iteration 1 returned 900000 rows in 655.309166ms Query q9 iteration 2 returned 900000 rows in 653.490167ms Query q9 iteration 3 returned 900000 rows in 655.535958ms Query q9 iteration 4 returned 900000 rows in 655.982292ms Query q10 iteration 0 returned 810000000 rows in 2.26567725s Query q10 iteration 1 returned 810000000 rows in 2.690937042s Query q10 iteration 2 returned 810000000 rows in 3.48998175s Query q10 iteration 3 returned 810000000 rows in 3.145351041s Query q10 iteration 4 returned 810000000 rows in 5.294884292s + set +x Done yongting@Yongtings-MacBook-Pro-2 ~/C/d/benchmarks (nlj-bench *)> ./bench.sh compare nlj-bench nlj-bench Comparing nlj-bench and nlj-bench -------------------- -------------------- Benchmark nlj.json -------------------- ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Query ┃ nlj-bench ┃ nlj-bench ┃ Change ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ QQuery q1 │ 206.90 ms │ 206.90 ms │ no change │ │ QQuery q2 │ 239.45 ms │ 239.45 ms │ no change │ │ QQuery q3 │ 358.53 ms │ 358.53 ms │ no change │ │ QQuery q4 │ 952.13 ms │ 952.13 ms │ no change │ │ QQuery q5 │ 572.23 ms │ 572.23 ms │ no change │ │ QQuery q6 │ 5728.68 ms │ 5728.68 ms │ no change │ │ QQuery q7 │ 525.80 ms │ 525.80 ms │ no change │ │ QQuery q8 │ 5505.89 ms │ 5505.89 ms │ no change │ │ QQuery q9 │ 653.49 ms │ 653.49 ms │ no change │ │ QQuery q10 │ 2265.68 ms │ 2265.68 ms │ no change │ └──────────────┴────────────┴────────────┴───────────┘ ``` </details> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org