[ https://issues.apache.org/jira/browse/SPARK-51272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Asif updated SPARK-51272: ------------------------- Attachment: BugTest.txt > Race condition in DagScheduler can result in failure of retrying all > partitions for non deterministic partitioning key > ---------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-51272 > URL: https://issues.apache.org/jira/browse/SPARK-51272 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.5.4 > Reporter: Asif > Priority: Major > Fix For: 4.0.0 > > Attachments: BugTest.txt > > > In DagScheduler, where a successful task completion occurs concurrently with > a task failure , for an inDeterminate stage, results in a situation , where > instead of re-executing all partitions, only some are retried. This results > in data loss. > The race condition identified is as follows: > a) A successful result stage task, is yet to mark in the boolean array > tracking partitions success/failure as true/false. > b) A concurrent failed result task, belonging to an InDeterminate stage, > idenitfies all the stages which needs/ can be rolled back. For Result Stage, > it looks into the array of successful partitions. As none is marked as true, > the ResultStage and dependent stages are delegated to thread pool for retry. > c) Between the time of collecting stages to rollback and re-try of stages, > the successful task marks boolean as true. > d) The Retry of Stage, as a result, misses the partition marked as > successful, for retry. > > Attaching two files for reproducing the functional bug , showing the race > condition causing data corruption. > I am attaching 2 files for bug test > # bugrepro.patch > This is needed to coax the single VM test to reproduce the issue. It has lots > of interception and tweaks to ensure that system is able to hit the data loss > situation. > ( like each partition writes only a shuffle file containing keys evaluating > to same hashCode and deleting the shuffle file at right time etc) > # The BugTest itself. > a) If the bugrepro.patch is applied to current master and the BugTest run, it > will fail immediately with assertion failure where instead of 12 rows, 6 rows > show up in result. > b) If the bugrepro.patch is applied on top of PR > [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029] > , then the BugTest will fail after one or two or more iterations, > indicating the race condition in DataScheduler/Stage interaction. > c) But if the same BugTest is run on branch containing fix for this bug as > well as the PR > [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029], > it will pass in all the 100 iteration. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org