ahshahid commented on PR #50029: URL: https://github.com/apache/spark/pull/50029#issuecomment-2691810968
@mridulm @squito , I am unsure as to what you mean by marking the RDD inDeterministic, without modifying the RDD code.... 1) There is no concrete field in the RDD which marks it inDeterministic ( The root RDD is always considered inDeterministic) 2) Based on existing code, there is a function in RDD and overridden in MapPartitionsRDD which identifies whether RDD is deterministic or not. And that code relies on Dependency and the RDD contained in dependencies. 3) It does not take into account anywhere, the indeterministic nature of PartitionEvaluator of the RDD. Consider the test "SPARK-51016: ShuffleMapStage using indeterministic join keys should be INDETERMINATE", in newly added file ShuffleMapStageTest In this case, the ShuffleMapStage contains ShuffleDependency and the corresponding RDD (MapPartitionsRDD). And MapPartitionsRDD 's depedencies is ParallelCollectionRDD. so in the above interaction, the knowledge that the partition evaluator is inDeterministic, is embedded in the Lambda passed to MapPartitionsRDD or is present in ShuffleDepedency code ( which is modified to store that data). For RDD to me marked as inDeterministic, what else do you have in mind ? Partitioner interface augmentation ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org