Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Shiao-An Yuan
Hi, I am using Spark 2.4.4 standalone mode. On Mon, Jan 18, 2021 at 4:26 AM Sean Owen wrote: > Hm, FWIW I can't reproduce that on Spark 3.0.1. What version are you using? > > On Sun, Jan 17, 2021 at 6:22 AM Shiao-An Yuan > wrote: > >> Hi folks, >> >>

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Shiao-An Yuan
. Therefore, the first stage and the retry stage might have different distribution and cause duplications and loss. Thanks, Shiao-An Yuan On Tue, Dec 29, 2020 at 10:00 PM Shiao-An Yuan wrote: > Hi folks, > > We recently identified a data correctness issue in our pipeline. > > The

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Shiao-An Yuan
p; lost, I mean duplicated "pkey" exists in the output file (after "reduce by key") and some "pkey" missing. Since it only happens when executors being preempted, I believe this is a bug (nondeterministic shuffle) that SPARK-23207 trying to solve. Thanks, Shiao-An Yuan

Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Shiao-An Yuan
Set, I believe it is unrelated to SPARK-24243. Can anyone give me some advice about the following tasks? Thanks in advance. Shiao-An Yuan