Re: Correctness bug on Shuffle+Repartition scenario

2021-01-18 Thread 王长春
Hi Shiao-An Yuan I also found this correctness problem in my production environment. My spark version is 2.3.1。 I thought it was because Spark-23243 before . But you said You also have this problem in your environment , and your version is 2.4.4 which had solved spark-23243. So Maybe this problem

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Shiao-An Yuan
Hi, I am using Spark 2.4.4 standalone mode. On Mon, Jan 18, 2021 at 4:26 AM Sean Owen wrote: > Hm, FWIW I can't reproduce that on Spark 3.0.1. What version are you using? > > On Sun, Jan 17, 2021 at 6:22 AM Shiao-An Yuan > wrote: > >> Hi folks, >> >> I finally found the root cause of this issue

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Mich Talebzadeh
Hi Shiao-An, With regard to your set-up below and I quote: "The input/output files are parquet on GCS. The Spark version is 2.4.4 with standalone deployment. Workers running on GCP preemptible instances and they being preempted very frequently." Am I correct that you have foregone deploying Data

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Gourav Sengupta
Hi, I may be wrong, but this looks like a massively complicated solution for what could have been a simple SQL. It always seems okay to be to first reduce the complexity and then solve it, rather than solve a problem which should not even exist in the first instance. Regards, Gourav On Sun, Jan

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Sean Owen
Hm, FWIW I can't reproduce that on Spark 3.0.1. What version are you using? On Sun, Jan 17, 2021 at 6:22 AM Shiao-An Yuan wrote: > Hi folks, > > I finally found the root cause of this issue. > It can be easily reproduced by the following code. > We ran it on a standalone mode 4 cores * 4 instanc

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Shiao-An Yuan
Hi folks, I finally found the root cause of this issue. It can be easily reproduced by the following code. We ran it on a standalone mode 4 cores * 4 instances (total 16 cores) environment. ``` import org.apache.spark.TaskContext import scala.sys.process._ import org.apache.spark.sql.functions._

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Sean Owen
I don't think this addresses my comment at all. Please try correctly implementing equals and hashCode for your key class first. On Tue, Dec 29, 2020 at 8:31 PM Shiao-An Yuan wrote: > Hi Sean, > > Sorry, I didn't describe it clearly. The column "pkey" is like a "Primary > Key" and I do "reduce by

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Shiao-An Yuan
Hi Sean, Sorry, I didn't describe it clearly. The column "pkey" is like a "Primary Key" and I do "reduce by key" on this column, so the "amount of rows" should always equal to the "cardinality of pkey". When I said data get duplicated & lost, I mean duplicated "pkey" exists in the output file (aft

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Sean Owen
Total guess here, but your key is a case class. It does define hashCode and equals for you, but, you have an array as one of the members. Array equality is by reference, so, two arrays of the same elements are not equal. You may have to define hashCode and equals manually to make them correct. On

Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Shiao-An Yuan
Hi folks, We recently identified a data correctness issue in our pipeline. The data processing flow is as follows: 1. read the current snapshot (provide empty if it doesn't exist yet) 2. read unprocessed new data 3. union them and do a `reduceByKey` operation 4. output a new version of the snapsh