holdenk commented on PR #54370: URL: https://github.com/apache/spark/pull/54370#issuecomment-3946934856
> 2. If I understand correctly, we stack data right? What does it mean for memory usage when the df is large? So eventually we'll have to collect the summaries back to the driver but the summaries will all have to fit anyways even when were executing in a loop since we store them in Python and display them. If someone had a silly number of columns this could maybe be an issue, but the old approach wouldn't work well either. There might be a bit of extra data during the final merge steps when we're merging the aggregate objects but if that ever became an issue we could look at treeReduce (but again this would likely only happen in a degenerate case where the current implementation would also not behave well). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
