[ https://issues.apache.org/jira/browse/SPARK-38137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jakub Leś updated SPARK-38137: ------------------------------ Component/s: Shuffle > Repartition+Shuffle+ non deterministic function leads to bad results > -------------------------------------------------------------------- > > Key: SPARK-38137 > URL: https://issues.apache.org/jira/browse/SPARK-38137 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core > Affects Versions: 3.1.1, 3.2.1 > Reporter: Jakub Leś > Priority: Major > > Hi, > The results when using a non deterministic function in repartition (like > rand) leads into incorrect results. > Reproduce: (correct) > > {code:java} > // code placeholder > import scala.sys.process._ > import org.apache.spark.TaskContext > import org.apache.spark.sql.functions.randval res = spark.range(0, 100 * 100, > 1).repartition(200).map { x => > x > }.repartition(200).map { x => > if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { > throw new Exception("pkill -f java".!!) > } > x > } > res.distinct().count() {code} > The correct result 10000 > Reproduce: (bad) > > {code:java} > // code placeholder > import scala.sys.process._ > import org.apache.spark.TaskContext > import org.apache.spark.sql.functions.randval res = spark.range(0, 100 * 100, > 1).repartition(200).map { x => > x > }.repartition(10, Array(rand):_*).map { x => > if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { > throw new Exception("pkill -f java".!!) > } > x > } > res.distinct().count() {code} > The bad result 9396 > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org