[ 
https://issues.apache.org/jira/browse/SPARK-51272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-51272:
-------------------------
    Labels: spark-core  (was: )

> Race condition in DagScheduler can result in failure of retrying all 
> partitions for non deterministic partitioning key
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-51272
>                 URL: https://issues.apache.org/jira/browse/SPARK-51272
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.5.4
>            Reporter: Asif
>            Priority: Major
>              Labels: spark-core
>             Fix For: 4.0.0
>
>         Attachments: BugTest.txt, bugrepro.patch
>
>
> In DagScheduler, where a successful task completion occurs concurrently with 
> a task failure , for an inDeterminate stage, results in a situation , where 
> instead of re-executing all partitions, only some are retried. This results 
> in data loss.
> The race condition identified is as follows:
> a) A successful result stage task, is yet to mark in the boolean array 
> tracking partitions success/failure as true/false.
> b) A concurrent failed result task, belonging to an InDeterminate stage, 
> idenitfies all the stages which needs/ can be rolled back. For Result Stage, 
> it looks into the array of successful partitions. As none is marked as true, 
> the ResultStage and dependent stages are delegated to thread pool for retry.
> c) Between the time of collecting stages to rollback and re-try of stages, 
> the successful task marks boolean as true.
> d) The Retry of Stage, as a result, misses the partition marked as 
> successful, for retry.
>  
> Attaching two files for reproducing the functional bug , showing the race 
> condition causing data corruption.
> I am attaching 2 files for bug test
>  # bugrepro.patch
> This is needed to coax the single VM test to reproduce the issue. It has lots 
> of interception and tweaks to ensure that system is able to hit the data loss 
> situation.
> ( like each partition writes only a shuffle file containing keys evaluating 
> to same hashCode and deleting the shuffle file at right time etc)
>  # The BugTest itself.
> a) If the bugrepro.patch is applied to current master and the BugTest run, it 
> will fail immediately with assertion failure where instead of 12 rows, 6 rows 
> show up in result.
> b) If the bugrepro.patch is applied on top of PR 
> [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029]
>   , then the BugTest will fail after one or two or more iterations, 
> indicating the race condition in DataScheduler/Stage interaction.
> c) But if the same BugTest is run on branch containing fix for this bug as 
> well as the PR 
> [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029],
>  it will pass in all the 100 iteration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to