ahshahid commented on code in PR #50757: URL: https://github.com/apache/spark/pull/50757#discussion_r2070564298
########## core/src/main/scala/org/apache/spark/rdd/RDD.scala: ########## Review Comment: > @ahshahid Regarding point 3 I was open for your change and asked you to extend your integration test with some extra logs to prove it is testing a specific case where I have seen problems. But that got stuck there: [attilapiros#8 (comment)](https://github.com/attilapiros/spark/pull/8#issuecomment-2797246777) > Let me re-read your comments to understand what you are hinting, but I dont see how that PR which is specifically exposing the race condition , relevant to this PR.. The question of race condition which the other PR is mentioning, will not even arise if stage always shows determinancy as true.. > Regarding point 5. I see value a lot of value in the `inDeterministic` flag. As currently we cannot distinguish whether a shuffle map stage indeterministic because of its parent or on its own. Let's say `ShuffleMapStageX` is indeterministic because of its operation and along the way to a result stage there is another `ShuffleMapStageY` which is only indeterministic as it is descendent of `ShuffleMapStageX` but the result stage is fetching from `ShuffleMapStageY` when the fetch failure happens we are still have a deterministic output so even if the result stage is half ready we can continue our work without reverting its output. (In addition latter it would make sense to extend the RDD API to let a user set it when they are using indeterminate operation in the map/flatMap body.) In my PR, A Shuffle Map stage is considered inDeterministic IFF it is using the inDeterministic value as partitioner. In the above example, if ShuffleMapStage Y is using the inDeterminate output originally created in ( ShuffleMapStage X), then only the stage Y is considered inDeterministic. and that is the right behaviour. If ShuffleStage Y is not using inDeterminate expression as partitioner, then no outputs need to be discraded of result stage.. So the problem you mention does not exist. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org