Hi Sean,

To finish the job, I did need to set spark.stage.maxConsecutiveAttempts to a 
large number e.g., 100; a suggestion from Jiang Xingbo.

I haven't seen any recent movement/PRs on this issue, but I'll see if we can 
repro with a more recent version of Spark. 

Best regards,
Tyson

-----Original Message-----
From: Sean Owen <sro...@gmail.com> 
Sent: Friday, August 9, 2019 7:49 AM
To: tcon...@gmail.com
Cc: dev <dev@spark.apache.org>
Subject: Re: [SPARK-23207] Repro

Interesting but I'd put this on the JIRA, and also test vs master first. It's 
entirely possible this is something else that was subsequently fixed, and maybe 
even backported for 2.4.4.
(I can't quite reproduce it - just makes the second job fail, which is also 
puzzling)

On Fri, Aug 9, 2019 at 8:11 AM <tcon...@gmail.com> wrote:
>
> Hi,
>
>
>
> We are able to reproduce this bug in Spark 2.4 using the following program:
>
>
>
> import scala.sys.process._
>
> import org.apache.spark.TaskContext
>
>
>
> val res = spark.range(0, 10000 * 10000, 1).map{ x => (x % 1000, 
> x)}.repartition(20)
>
> res.distinct.count
>
>
>
> // kill an executor in the stage that performs repartition(239)
>
> val df = res.repartition(113).cache.repartition(239).map { x =>
>
>   if (TaskContext.get.attemptNumber == 0 && 
> TaskContext.get.partitionId < 1) {
>
>     throw new Exception("pkill -f java".!!)
>
>   }
>
>   x
>
> }
>
> df.distinct.count()
>
>
>
> The first df.distinct.count correctly produces 100000000
>
> The second df.distinct.count incorrect produces 99999769
>
>
>
> If the cache step is removed then the bug does not reproduce.
>
>
>
> Best regards,
>
> Tyson
>
>


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to