ahshahid commented on PR #50757: URL: https://github.com/apache/spark/pull/50757#issuecomment-2847859035
@attilapiros : I get what you are pointing out in RDD code... Its committers call. My view is : 1) A inDeterministic expression is something which is not predictable and no order of any form should be expected from it, whether read once or multiple times. 2) The only requirement from spark side should be that if a Partitioner is using that inDeterministic component, then it should not loose/add extra row during retry. If we stick to above 2 requirements, 1) then the Shuffle Stage / or any stage, should just consult the RDDs which it has to see if its deterministic or not. 2) An RDD's determinism shoiuld be just based on nature of the Partitioner it has. If the Partitioner used by RDD is using an inDeterministic expression , then it should be marked as inDeterminate ( and as of now AFAIK, its possible only in SQL Layer's RDD where the Partitioner has that info...) 3) In case of core , from what I have understood the problem is related to Round Robin Partitioning logic... Now that ideally is a separate issue and should not have been tied to the inDeterminism... AFAIk if round robin issue is not mixed with inDeterminancy , there is no way an RDD can have inDeterminancy as true ( nor is it needed). And I might be off the mark, but if Round Robin Partitioning issue needs to be resolved by piggy backing on inDeterminancyy logic, then may be the check could be as small as marking any RDD with RoundRobin Partitioner as inDeterminate should solve the issue , as what is needed is retry of all partitions in that case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org