Ngone51 commented on PR #52606: URL: https://github.com/apache/spark/pull/52606#issuecomment-3516567220
I just realize that we can not clean the shuffle files only while leaving the shuffle statuses uncleaned. Because for the case like dataframe queries, it is very common to reuse a dataframe accross the quries. After the dataframe is executed for the first time, its related shuffle files are all cleaned but the shuffle statuses still exists. So when the dataframe is reused to run the queries, it would mistakenly think the shuffle files still there given the existing shuffle statues but failed at runtime due to the missing shuffle files. I have pushed a new proposal, which tries to fail the query (usually the subquery) when the shuffle is no-longer registered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
