UBarney commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-2986400295
# benchmark I use this [script](https://gist.github.com/UBarney/9dcbf304e65f061d3352b34abd0f0e05#file-sql_bench-py) to do benchmark | ID | SQL | join_base Time(s) | join_limit_join_batch_size Time(s) | Performance Change | |----|-----|-------------|------------|-------------------| | 1 | select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value < t1.value * t2.value; | 0.852 | 0.548 | +35.70% | | 2 | select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value > t1.value * t2.value; | 0.692 | 0.387 | +44.03% | | 3 | select t1.value from range(8192) t1 right join range(8192) t2 on t1.value + t2.value > t1.value * t2.value; | 0.707 | 0.386 | +45.35% | | 4 | select t1.value from range(8192) t1 join range(81920) t2 on t1.value + t2.value < t1.value * t2.value; | Failed | 1.680 | N/A | | 5 | select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value > t1.value * t2.value; | 0.321 | 0.078 | +75.75% | I'll find out why there is a performance improvement # memory usage This version now supports joining large left and right tables, preventing previous OOM errors seen in the main branch <details> ``` /usr/bin/time -v ./target/release/join_limit_join_batch_size --maxrows 1 DataFusion CLI v48.0.0 > select t1.value from range(81920) t1 join range(8192) t2 on t1.value + t2.value > t1.value * t2.value; +-------+ | value | +-------+ | 1 | | . | | . | | . | +-------+ 180219 row(s) fetched. (First 1 displayed. Use --maxrows to adjust) Elapsed 4.653 seconds. Command being timed: "./target/release/join_limit_join_batch_size --maxrows 1" User time (seconds): 2.98 System time (seconds): 1.67 Percent of CPU this job got: 24% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:19.00 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 8100012 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 1967478 Voluntary context switches: 177 Involuntary context switches: 10 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 /usr/bin/time -v ./target/release/join_base --maxrows 1 DataFusion CLI v48.0.0 > select t1.value from range(81920) t1 join range(8192) t2 on t1.value + t2.value > t1.value * t2.value; Command terminated by signal 9 Command being timed: "./target/release/join_base --maxrows 1" User time (seconds): 4.04 System time (seconds): 5.99 Percent of CPU this job got: 78% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.80 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 28959396 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 954 Minor (reclaiming a frame) page faults: 7216264 Voluntary context switches: 1181 Involuntary context switches: 29 Swaps: 0 File system inputs: 238720 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org