SauronShepherd commented on PR #49724: URL: https://github.com/apache/spark/pull/49724#issuecomment-2623472398
I was expecting your first point since it's a behavioral change. However, given that keeping the default value as-is has a major impact on memory I thought it was worth discussing. I still believe that converting plans to strings shouldn't be Spark's default behavior unless strictly necessary. Besides, I'm not sure many Spark users would even notice it -well, actually maybe they would do, because of the performance improvement in their applications-. I've been working with Spark for over eight years, and whenever I needed to inspect a plan, I explicitly ran the explain method rather than relying on the Spark UI. However, I’ve often seen teams resort to checkpointing their DataFrames simply because Spark took what felt like an eternity to begin execution— and this month I've observed the same looking at GraphFrames code. I still don’t see the need for Spark to be so verbose internally; in my view, performance should take precedence over verbosity in logging. Regarding the new "off" value being placed at the end, the changes are minimal. However, to me, it makes more sense for "off" to come before the value that produces less content (i.e., "simple"), following a logical progression from lower to higher verbosity. As for the change in cachedName, it doesn’t yield the exact same result, but since it’s merely a literal identifier -a description- and not a key, this difference shouldn't matter. Like the previous changes, this modification is crucial to avoid traversing the entire tree of the plan each time a CachedRDDBuilder is created. Because even with the new explain mode off, despite avoiding the OOM, generating the string still has a significant impact in performance, even for not to large plans. That said, I have no problem in making these changes in my PR—I was just hoping to address an internal behavior that I’m not convinced Spark should have by default. Thanks for your feedback, @dongjoon-hyun Ángel -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org