ahshahid commented on PR #50029:
URL: https://github.com/apache/spark/pull/50029#issuecomment-2691810968

   @mridulm  @squito ,
   I am unsure as to what you mean by marking the RDD inDeterministic, without 
modifying the RDD code....
   1) There is no concrete field in the RDD which marks it inDeterministic ( 
The root RDD is always considered inDeterministic)
   2) Based on existing code, there is a function in RDD and overridden in 
MapPartitionsRDD which identifies whether RDD is deterministic or not.  And 
that code relies on Dependency and the RDD contained in dependencies.
   3) It does not take into account anywhere, the indeterministic nature of 
PartitionEvaluator  of the RDD.
   
   Consider the test "SPARK-51016: ShuffleMapStage using indeterministic join 
keys should be INDETERMINATE", in newly added file ShuffleMapStageTest
   In this case, the  ShuffleMapStage contains ShuffleDependency and the 
corresponding RDD (MapPartitionsRDD).  And MapPartitionsRDD 's depedencies is 
ParallelCollectionRDD.
   so in the above  interaction,  the knowledge that the partition evaluator is 
inDeterministic, is embedded in the Lambda passed to MapPartitionsRDD or is 
present in ShuffleDepedency code ( which is modified to store that data).
   
   For RDD to me marked as inDeterministic, what else do you have in mind ? 
Partitioner interface augmentation ?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to