Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Sean Owen
I don't think this addresses my comment at all. Please try correctly implementing equals and hashCode for your key class first. On Tue, Dec 29, 2020 at 8:31 PM Shiao-An Yuan wrote: > Hi Sean, > > Sorry, I didn't describe it clearly. The column "pkey" is like a "Primary > Key" and I do "reduce by

About

2020-12-29 Thread LInda hackkanan
  check  it out   https://backbutton.co.uk/about.html     Regards not Linda   - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Shiao-An Yuan
Hi Sean, Sorry, I didn't describe it clearly. The column "pkey" is like a "Primary Key" and I do "reduce by key" on this column, so the "amount of rows" should always equal to the "cardinality of pkey". When I said data get duplicated & lost, I mean duplicated "pkey" exists in the output file (aft

[Spark Streaming] Why is ZooKeeper LeaderElection Agent not being called by Spark Master?

2020-12-29 Thread Saloni Mehta
Hello, Request you to please help me out on the below queries: I have 2 spark masters and 3 zookeepers deployed on my system on separate virtual machines. The services come up online in the below sequence: 1. zookeeper-1 2. sparkmaster-1 3. sparkmaster-2 4. zookeeper-2 5. zookeepe

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Sean Owen
Total guess here, but your key is a case class. It does define hashCode and equals for you, but, you have an array as one of the members. Array equality is by reference, so, two arrays of the same elements are not equal. You may have to define hashCode and equals manually to make them correct. On

Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Shiao-An Yuan
Hi folks, We recently identified a data correctness issue in our pipeline. The data processing flow is as follows: 1. read the current snapshot (provide empty if it doesn't exist yet) 2. read unprocessed new data 3. union them and do a `reduceByKey` operation 4. output a new version of the snapsh