timsaucer commented on PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036#issuecomment-2721032760
I am testing my updated consolidated code now. When running on a 10 GB scale factor TPC-H query, I get comparable times for both q1 and q2, which take 10 and 53 s on my m4 pro to run at that scale. I will next test on 1gb scale factor and then the tiny batches that were discussed in #1015  One metric is comparing `df.show()` with `df.__repr__()`. The former calls the previous code essentially. The latter is the updated call. I also tested against main to find comparable values. For q2 for example: df.show(): 53.881720781326294 df.__repr__(): 52.33351922035217 When dropping down to a 1GB data set df.show() took: 0.8244500160217285 df.__repr__() took 0.8161180019378662 The same 1GB against main df.show() took: 0.8473942279815674 df.__repr__() took 0.8100850582122803 Finally, for the tiny dataset (increased to 3 record batches so we do get multiple processing steps) Average runtime over 100 runs: 0.001016 seconds (this branch) Average runtime over 100 runs: 0.001011 seconds (main) And lastly, to verify it also resolves #1014 <img width="984" alt="image" src="https://github.com/user-attachments/assets/1a0ccd6a-ad0c-475f-a6ca-da1758930423" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org