ahshahid commented on PR #50033: URL: https://github.com/apache/spark/pull/50033#issuecomment-2695824630
> The provided unittest is invalid as it uses multiple threads to process events meanwhile the production code has a single dedicated thread so a race condition is not possible. > > If you still think it is a race condition and you can reproduce the problem with a standalone test I suggest to add log lines two those places where you think the two threads are competing and use the "%t" formatter in log4j2 to include the thread names in the log. In this case please attach the reproduction code without any production code change (only the new logging lines should be added but it is fine if the reproduction should be retried 1000 times as race conditions are flaky in nature but I prefer the original production code) and attach the section of the logs where you think the race occurs. The end to end bug reproduction without product code change is not feasible for me ( atleast at this point), in a single VM unit test. The race condition is possible ( and happens) because in the DagScheduler ::handleTaskCompletion method, there is asynchronicity introduced due to following snippet of code ``` messageScheduler.schedule( new Runnable { override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages) }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS ) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org