mridulm commented on PR #50630: URL: https://github.com/apache/spark/pull/50630#issuecomment-2814730525
> But that's also bad for an indeterminate stage as the data is inconsistent. I mean the committed partitions are coming from a previous old computation and not from the latest one but the resubmitted ones are coming from the new one. If parent map stage was indeterminate - existing spark code would have already aborted the stage. As you have pointed out in the test in this PR, there is a gap in the existing impl - which is that when there is a shuffle loss due to executor/host failure (and not due to fetch failure) - the check for determinism was not being performed; and so if shuffle files are lost for an indeterminate stage, the check to abort its child stages was not being done. This is indeed a bug, which needs to be addressed - and I have proposed two options for it. But that does not require failing the result stage - even if it is indeterminate. > So if we write the df to a table and some but not all tasks was successful and a resubmit happened we might have inconsistent result where sum(cnt) won't be 1000 when we load back the data as the resubmit might run on the shuffle map stage which regenerated the random values but with a different distribution of the value from 0 to 10. The complete shuffle map stage re-executed but the result stage not. This will not happen - please see above. "some but not all tasks was successful and a resubmit happened" -> if it results in reexecution of the parent stage, job will be aborted. If it does not result in re-execution of parent stage - the computation is deterministic - it is essentially: "WITH foo AS (SELECT key, count(key) as cnt FROM <constant table> GROUP BY rand) SELECT SUM(cnt) FROM foo" - which will always give same result. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org