Re: [PR] [SPARK-51272][CORE] Aborting instead of re-submitting of partially completed indeterminate result stage [spark]

via GitHub Fri, 18 Apr 2025 00:09:51 -0700


mridulm commented on PR #50630:
URL: https://github.com/apache/spark/pull/50630#issuecomment-2814730525


   > But that's also bad for an indeterminate stage as the data is 
inconsistent. I mean the committed partitions are coming from a previous old 
computation and not from the latest one but the resubmitted ones are coming 
from the new one.
   
   If parent map stage was indeterminate - existing spark code would have 
already aborted the stage.
   
   As you have pointed out in the test in this PR, there is a gap in the 
existing impl - which is that when there is a shuffle loss due to executor/host 
failure (and not due to fetch failure) - the check for determinism was not 
being performed; and so if shuffle files are lost for an indeterminate stage, 
the check to abort its child stages was not being done.
   This is indeed a bug, which needs to be addressed - and I have proposed two 
options for it.
   
   But that does not require failing the result stage - even if it is 
indeterminate.
   
   > So if we write the df to a table and some but not all tasks was successful 
and a resubmit happened we might have inconsistent result where sum(cnt) won't 
be 1000 when we load back the data as the resubmit might run on the shuffle map 
stage which regenerated the random values but with a different distribution of 
the value from 0 to 10. The complete shuffle map stage re-executed but the 
result stage not.
   
   This will not happen - please see above.
   
   "some but not all tasks was successful and a resubmit happened" -> if it 
results in reexecution of the parent stage, job will be aborted.
   If it does not result in re-execution of parent stage - the computation is 
deterministic - it is essentially: "WITH foo AS (SELECT key, count(key) as cnt 
FROM <constant table>  GROUP BY rand) SELECT SUM(cnt) FROM foo" - which will 
always give same result.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51272][CORE] Aborting instead of re-submitting of partially completed indeterminate result stage [spark]

Reply via email to