UBarney commented on PR #16443:
URL: https://github.com/apache/datafusion/pull/16443#issuecomment-2986400295

   # benchmark
   I use this 
[script](https://gist.github.com/UBarney/9dcbf304e65f061d3352b34abd0f0e05#file-sql_bench-py)
 to do benchmark
   
   | ID | SQL | join_base Time(s) | join_limit_join_batch_size Time(s) | 
Performance Change |
   |----|-----|-------------|------------|-------------------|
   | 1 | select t1.value from range(8192) t1 join range(8192) t2 on t1.value + 
t2.value < t1.value * t2.value; | 0.852 | 0.548 | +35.70% |
   | 2 | select t1.value from range(8192) t1 join range(8192) t2 on t1.value + 
t2.value > t1.value * t2.value; | 0.692 | 0.387 | +44.03% |
   | 3 | select t1.value from range(8192) t1 right join range(8192) t2 on 
t1.value + t2.value > t1.value * t2.value; | 0.707 | 0.386 | +45.35% |
   | 4 | select t1.value from range(8192) t1 join range(81920) t2 on t1.value + 
t2.value < t1.value * t2.value; | Failed | 1.680 | N/A |
   | 5 | select t1.value from range(100) t1 join range(819200) t2 on t1.value + 
t2.value > t1.value * t2.value; | 0.321 | 0.078 | +75.75% |
   
   I'll find out why there is a performance improvement
   
   # memory usage
   
   This version now supports joining large left and right tables, preventing 
previous OOM errors seen in the main branch
   
   <details>
   ```
   /usr/bin/time -v ./target/release/join_limit_join_batch_size --maxrows 1 
   DataFusion CLI v48.0.0
   > select t1.value from range(81920) t1 join range(8192) t2 on t1.value + 
t2.value > t1.value * t2.value;
   +-------+
   | value |
   +-------+
   | 1     |
   | .     |
   | .     |
   | .     |
   +-------+
   180219 row(s) fetched. (First 1 displayed. Use --maxrows to adjust)
   Elapsed 4.653 seconds.
   
           Command being timed: "./target/release/join_limit_join_batch_size 
--maxrows 1"
           User time (seconds): 2.98
           System time (seconds): 1.67
           Percent of CPU this job got: 24%
           Elapsed (wall clock) time (h:mm:ss or m:ss): 0:19.00
           Average shared text size (kbytes): 0
           Average unshared data size (kbytes): 0
           Average stack size (kbytes): 0
           Average total size (kbytes): 0
           Maximum resident set size (kbytes): 8100012
           Average resident set size (kbytes): 0
           Major (requiring I/O) page faults: 0
           Minor (reclaiming a frame) page faults: 1967478
           Voluntary context switches: 177
           Involuntary context switches: 10
           Swaps: 0
           File system inputs: 0
           File system outputs: 0
           Socket messages sent: 0
           Socket messages received: 0
           Signals delivered: 0
           Page size (bytes): 4096
           Exit status: 0
   
   /usr/bin/time -v ./target/release/join_base --maxrows 1 
   DataFusion CLI v48.0.0
   > select t1.value from range(81920) t1 join range(8192) t2 on t1.value + 
t2.value > t1.value * t2.value;
   Command terminated by signal 9
           Command being timed: "./target/release/join_base --maxrows 1"
           User time (seconds): 4.04
           System time (seconds): 5.99
           Percent of CPU this job got: 78%
           Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.80
           Average shared text size (kbytes): 0
           Average unshared data size (kbytes): 0
           Average stack size (kbytes): 0
           Average total size (kbytes): 0
           Maximum resident set size (kbytes): 28959396
           Average resident set size (kbytes): 0
           Major (requiring I/O) page faults: 954
           Minor (reclaiming a frame) page faults: 7216264
           Voluntary context switches: 1181
           Involuntary context switches: 29
           Swaps: 0
           File system inputs: 238720
           File system outputs: 0
           Socket messages sent: 0
           Socket messages received: 0
           Signals delivered: 0
           Page size (bytes): 4096
           Exit status: 0
   ```
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to