Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh
u raised > > "" One of the offerings from the service we use is EBS migration which > basically means if a host is about to get evicted, a new host is created > and the EBS volume is attached to it. When Spark assigns a new executor > to the newly created instance, it basically

Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh
e offerings from the service we use is EBS migration which basically means if a host is about to get evicted, a new host is created and the EBS volume is attached to it. When Spark assigns a new executor to the newly created instance, it basically can recover all the shuffle files that are already persi

RE: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Maksym M
heavily. > > Have you looked at why you are having these shuffles? What is the cause of > > these large transformations ending up in shuffle > > > > Also on your point: > > "..then ideally we should expect that when an executor is killed/OOM'd > >

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-17 Thread vaquar khan
expect that when an executor is killed/OOM'd > and a new executor is spawned on the same host, the new executor registers > the shuffle files to itself. Is that so?" > > What guarantee is that the new executor with inherited shuffle files will > succeed? > > Als

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-15 Thread Mich Talebzadeh
en an executor is killed/OOM'd and a new executor is spawned on the same host, the new executor registers the shuffle files to itself. Is that so?" What guarantee is that the new executor with inherited shuffle files will succeed? Also OOM is often associated with some form of skewed d

[spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-15 Thread Faiz Halde
ed to it When Spark assigns a new executor to the newly created instance, it basically can recover all the shuffle files that are already persisted in the migrated EBS volume Is this how it works? Do executors recover / re-register the shuffle files that they found? So far I have not come acros

Re: [EXTERNAL] spark re-use shuffle files not happening

2022-07-16 Thread Koert Kuipers
ok thanks. guess i am simply misremembering that i saw the shuffle files getting re-used across jobs (actions). it was probably across stages for the same job. in structured streaming this is a pretty big deal. if you join a streaming dataframe with a large static dataframe each microbatch

Re: [EXTERNAL] spark re-use shuffle files not happening

2022-07-16 Thread Shay Elbaz
Spark can reuse shuffle stages in the same job (action), not cross jobs. From: Koert Kuipers Sent: Saturday, July 16, 2022 6:43 PM To: user Subject: [EXTERNAL] spark re-use shuffle files not happening ATTENTION: This email originated from outside of GM. i

spark re-use shuffle files not happening

2022-07-16 Thread Koert Kuipers
i have seen many jobs where spark re-uses shuffle files (and skips a stage of a job), which is an awesome feature given how expensive shuffles are, and i generally now assume this will happen. however i feel like i am going a little crazy today. i did the simplest test in spark 3.3.0, basically i

Re: Spark with External Shuffle Service - using saved shuffle files in the event of executor failure

2021-05-12 Thread Attila Zsolt Piros
Hello, I have answered it on the Stack Overflow. Best Regards, Attila On Wed, May 12, 2021 at 4:57 PM Chris Thomas wrote: > Hi, > > I am pretty confident I have observed Spark configured with the Shuffle > Service continuing to fetch shuffle files on a node in the event of > e

Spark with External Shuffle Service - using saved shuffle files in the event of executor failure

2021-05-12 Thread Chris Thomas
Hi, I am pretty confident I have observed Spark configured with the Shuffle Service continuing to fetch shuffle files on a node in the event of executor failure, rather than recompute the shuffle files as happens without the Shuffle Service. Can anyone confirm this? (I have a SO question

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-23 Thread Holden Karau
You can also look at the shuffle file cleanup tricks we do inside of the ALS algorithm in Spark. On Fri, Feb 23, 2018 at 6:20 PM, vijay.bvp wrote: > have you looked at > http://apache-spark-user-list.1001560.n3.nabble.com/Limit- > Spark-Shuffle-Disk-Usage-td23279.html > > and the post mentioned

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-23 Thread vijay.bvp
have you looked at http://apache-spark-user-list.1001560.n3.nabble.com/Limit-Spark-Shuffle-Disk-Usage-td23279.html and the post mentioned there https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html also try compressing the output https://spark.apache.o

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread naresh Goud
Got it. I understood issue in different way. On Thu, Feb 22, 2018 at 9:19 PM Keith Chapman wrote: > My issue is that there is not enough pressure on GC, hence GC is not > kicking in fast enough to delete the shuffle files of previous iterations. > > Regards, > Keith. &

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread Keith Chapman
My issue is that there is not enough pressure on GC, hence GC is not kicking in fast enough to delete the shuffle files of previous iterations. Regards, Keith. http://keith-chapman.com On Thu, Feb 22, 2018 at 6:58 PM, naresh Goud wrote: > It would be very difficult to tell without know

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread naresh Goud
ation that I run out of disk space > on the /tmp directory. On further investigation I was able to figure out > that the reason for this is that the shuffle files are still around, > because I have a very large hear GC has not happen and hence the shuffle > files are not deleted. I was a

Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread Keith Chapman
ce on the /tmp directory. On further investigation I was able to figure out that the reason for this is that the shuffle files are still around, because I have a very large hear GC has not happen and hence the shuffle files are not deleted. I was able to confirm this by lowering the heap size and I s

Re: Spark shuffle files

2017-03-27 Thread Mark Hamstra
When the RDD using them goes out of scope. On Mon, Mar 27, 2017 at 3:13 PM, Ashwin Sai Shankar wrote: > Thanks Mark! follow up question, do you know when shuffle files are > usually un-referenced? > > On Mon, Mar 27, 2017 at 2:35 PM, Mark Hamstra > wrote: > >> Shuffl

Re: Spark shuffle files

2017-03-27 Thread Ashwin Sai Shankar
Thanks Mark! follow up question, do you know when shuffle files are usually un-referenced? On Mon, Mar 27, 2017 at 2:35 PM, Mark Hamstra wrote: > Shuffle files are cleaned when they are no longer referenced. See > https://github.com/apache/spark/blob/master/core/src/ > main/scala/o

Re: Spark shuffle files

2017-03-27 Thread Mark Hamstra
Shuffle files are cleaned when they are no longer referenced. See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala On Mon, Mar 27, 2017 at 12:38 PM, Ashwin Sai Shankar < ashan...@netflix.com.invalid> wrote: > Hi! > > In spark on

Spark shuffle files

2017-03-27 Thread Ashwin Sai Shankar
Hi! In spark on yarn, when are shuffle files on local disk removed? (Is it when the app completes or once all the shuffle files are fetched or end of the stage?) Thanks, Ashwin

Fwd: Connection failure followed by bad shuffle files during shuffle

2016-03-15 Thread Eric Martin
Hi, I'm running into consistent failures during a shuffle read while trying to do a group-by followed by a count aggregation (using the DataFrame API on Spark 1.5.2). The shuffle read (in stage 1) fails with org.apache.spark.shuffle.FetchFailedException: Failed to send RPC 7719188499899260109 to

Re: FileNotFoundException in appcache shuffle files

2015-12-10 Thread Jiří Syrový
y some google > search, but in vain.. > any idea? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/FileNotFoundException-in-appcache-shuffle-files-tp17605p25663.html > Sent from the Apache Spark User Lis

RE: FileNotFoundException in appcache shuffle files

2015-12-10 Thread kendal
-shuffle-files-tp17605p25663.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Checkpointing not removing shuffle files from local disk

2015-12-03 Thread Ewan Higgs
Hi all, We are running a class with Pyspark notebook for data analysis. Some of the books are fairly long and have a lot of operations. Through the course of the notebook, the shuffle storage expands considerably and often exceeds quota (e.g. 1.5GB input expands to 24GB in shuffle files). Closing

Checkpointing not removing shuffle files from local disk

2015-09-29 Thread ramibatal
pus)), the algorithm start to create shuffle files on my disk at at point that it fills it up till there is space left. I am using spark-submit to run my program as follow: spark-submit --driver-memory 14G --class com.heystaks.spark.ml.topicmodelling.LDAExample ./target/scala-2.10/lda-assembly-1.

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
used by seeing skipped stages in the job UI. They are > periodically cleaned up based on available space of the configured > spark.local.dirs paths. > > From: Thomas Gerber > Date: Monday, June 29, 2015 at 10:12 PM > To: user > Subject: Shuffle files lifecycle > > Hello

Re: Shuffle files lifecycle

2015-06-29 Thread Silvio Fiorito
spark.local.dirs paths. From: Thomas Gerber Date: Monday, June 29, 2015 at 10:12 PM To: user Subject: Shuffle files lifecycle Hello, It is my understanding that shuffle are written on disk and that they act as checkpoints. I wonder if this is true only within a job, or across jobs. Please note that I

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
act > as checkpoints. > > I wonder if this is true only within a job, or across jobs. Please note > that I use the words job and stage carefully here. > > 1. can a shuffle created during JobN be used to skip many stages from > JobN+1? Or is the lifecycle of the shuffle files bound

Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
+1? Or is the lifecycle of the shuffle files bound to the job that created them? 2. when are shuffle files actually deleted? Is it TTL based or is it cleaned when the job is over? 3. we have a very long batch application, and as it goes on, the number of total tasks for each job gets larger and

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-24 Thread N B
Hi TD, That little experiment helped a bit. This time we did not see any exceptions for about 16 hours but eventually it did throw the same exceptions as before. The cleaning of the shuffle files also stopped much before these exceptions happened - about 7-1/2 hours after startup. I am not quite

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-24 Thread N B
ing application? Was it falling behind > with a large increasing scheduling delay? > > TD > > On Thu, Apr 23, 2015 at 11:31 AM, N B wrote: > >> Thanks for the response, Conor. I tried with those settings and for a >> while it seemed like it was cleaning up shuffle files after

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-23 Thread Tathagata Das
What was the state of your streaming application? Was it falling behind with a large increasing scheduling delay? TD On Thu, Apr 23, 2015 at 11:31 AM, N B wrote: > Thanks for the response, Conor. I tried with those settings and for a > while it seemed like it was cleaning up shuffle

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-23 Thread N B
Thanks for the response, Conor. I tried with those settings and for a while it seemed like it was cleaning up shuffle files after itself. However, after exactly 5 hours later it started throwing exceptions and eventually stopped working again. A sample stack trace is below. What is curious about 5

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-21 Thread Conor Fennell
Hi, We set the spark.cleaner.ttl to some reasonable time and also set spark.streaming.unpersist=true. Those together cleaned up the shuffle files for us. -Conor On Tue, Apr 21, 2015 at 8:18 AM, N B wrote: > We already do have a cron job in place to clean just the shuffle files. > H

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-21 Thread N B
We already do have a cron job in place to clean just the shuffle files. However, what I would really like to know is whether there is a "proper" way of telling spark to clean up these files once its done with them? Thanks NB On Mon, Apr 20, 2015 at 10:47 AM, Jeetendra Gangele wrote:

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-20 Thread Jeetendra Gangele
ame "spark-*-*-*" -prune -exec rm -rf {} \+ On 20 April 2015 at 23:12, N B wrote: > Hi all, > > I had posed this query as part of a different thread but did not get a > response there. So creating a new thread hoping to catch someone's > attention. > > We are exper

Shuffle files not cleaned up (Spark 1.2.1)

2015-04-20 Thread N B
Hi all, I had posed this query as part of a different thread but did not get a response there. So creating a new thread hoping to catch someone's attention. We are experiencing this issue of shuffle files being left behind and not being cleaned up by Spark. Since this is a Spark stre

Re: Missing shuffle files

2015-02-28 Thread Corey Nolet
gt; lost in the UI). If I don't coalesce, I pretty immediately get Java heap >>> space exceptions that kill the job altogether. >>> >>> Putting in the timeouts didn't seem to help the case where I am >>> coalescing. Also, I don't see any dfference

Re: Missing shuffle files

2015-02-24 Thread Anders Arpteg
pretty immediately get Java heap >> space exceptions that kill the job altogether. >> >> Putting in the timeouts didn't seem to help the case where I am >> coalescing. Also, I don't see any dfferences between 'disk only' and >> 'memory and disk

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
t's >> looking better... >> >> On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet wrote: >> >>> I'm looking @ my yarn container logs for some of the executors which >>> appear to be failing (with the missing shuffle files). I see exceptions >>> that

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
don't see any dfferences between 'disk only' and 'memory and disk' storage levels- both of them are having the same problems. I notice large shuffle files (30-40gb) that only seem to spill a few hundred mb. On Mon, Feb 23, 2015 at 4:28 PM, Anders Arpteg wrote: > Sounds ver

Re: Missing shuffle files

2015-02-23 Thread Anders Arpteg
ing better... On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet wrote: > I'm looking @ my yarn container logs for some of the executors which > appear to be failing (with the missing shuffle files). I see exceptions > that say "client.TransportClientFactor: Found inactive connection

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
I'm looking @ my yarn container logs for some of the executors which appear to be failing (with the missing shuffle files). I see exceptions that say "client.TransportClientFactor: Found inactive connection to host/ip:port, closing it." Right after that I see "shuffle.

Re: Missing shuffle files

2015-02-23 Thread Anders Arpteg
over 1.3TB of memory >> allocated for the application. I was thinking perhaps it was possible that >> a single executor was getting a single or a couple large partitions but >> shouldn't the disk persistence kick in at that point? >> >> On Sat, Feb 21, 2015 at 11:20

Re: Missing shuffle files

2015-02-22 Thread Sameer Farooqui
cation. I was thinking perhaps it was possible that > a single executor was getting a single or a couple large partitions but > shouldn't the disk persistence kick in at that point? > > On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg > wrote: > >> For large jobs, the follo

Re: Missing shuffle files

2015-02-21 Thread Petar Zecevic
Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg <mailto:arp...@spotify.com>> wrote: For large jobs, the following error message is shown that seems to indicate that shuffle files for some reason are missing. It's a rather large job with many partitions. If the data size

Re: Missing shuffle files

2015-02-21 Thread Corey Nolet
ps it was possible that a single executor was getting a single or a couple large partitions but shouldn't the disk persistence kick in at that point? On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg wrote: > For large jobs, the following error message is shown that seems to > indicate th

Missing shuffle files

2015-02-21 Thread Anders Arpteg
For large jobs, the following error message is shown that seems to indicate that shuffle files for some reason are missing. It's a rather large job with many partitions. If the data size is reduced, the problem disappears. I'm running a build from Spark master post 1.2 (build at 2015-

Re: FileNotFoundException in appcache shuffle files

2015-01-10 Thread Aaron Davidson
r-Errno-2-No-such-file-or-directory-tmp-spark-9e23f17e-2e23-4c26-9621-3cb4d8b832da-tmp3i3xno-td21076.html > > Thank you, > Lucio > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/FileNotFoundException-in-appcache-shuffle-files

Re: FileNotFoundException in appcache shuffle files

2015-01-10 Thread lucio raimondo
Thank you, Lucio -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/FileNotFoundException-in-appcache-shuffle-files-tp17605p21077.html Sent from the Apache Spark User List mailing list archive at

RE: FileNotFoundException in appcache shuffle files

2014-10-29 Thread Ganelin, Ilya
ser/201410.mbox/ -Original Message- From: Shao, Saisai [saisai.s...@intel.com<mailto:saisai.s...@intel.com>] Sent: Wednesday, October 29, 2014 01:46 AM Eastern Standard Time To: Ryan Williams Cc: user Subject: RE: FileNotFoundException in appcache shuffle files Hi Ryan, This is an issue from so

Re: FileNotFoundException in appcache shuffle files

2014-10-28 Thread Shaocun Tian
h7kf8/adam.108?dl=0 > [2] > http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCANGvG8qtK57frWS+kaqTiUZ9jSLs5qJKXXjXTTQ9eh2-GsrmpA@...%3E > <http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3ccangvg8qtk57frws+kaqtiuz9jsls5qjkxxjxttq9eh2-gsr...@mail.gma

RE: FileNotFoundException in appcache shuffle files

2014-10-28 Thread Shao, Saisai
. Thanks Jerry From: nobigdealst...@gmail.com [mailto:nobigdealst...@gmail.com] On Behalf Of Ryan Williams Sent: Wednesday, October 29, 2014 1:31 PM To: user Subject: FileNotFoundException in appcache shuffle files My job is failing with the following error: 14/10/29 02:59:14 WARN

FileNotFoundException in appcache shuffle files

2014-10-28 Thread Ryan Williams
My job is failing with the following error: 14/10/29 02:59:14 WARN scheduler.TaskSetManager: Lost task 1543.0 in stage 3.0 (TID 6266, demeter-csmau08-19.demeter.hpc.mssm.edu): java.io.FileNotFoundException: /data/05/dfs/dn/yarn/nm/usercache/willir31/appcache/application_1413512480649_0108/spark-lo

RE: Shuffle files

2014-10-20 Thread Shao, Saisai
Cc: Sunny Khatri; Lisonbee, Todd; u...@spark.incubator.apache.org Subject: Re: Shuffle files My observation is opposite. When my job runs under default spark.shuffle.manager, I don't see this exception. However, when it runs with SORT based, I start seeing this error? How would that be pos

Re: Shuffle files

2014-10-20 Thread Chen Song
1560.n3.nabble.com/quot-Too-many-open-files-quot-exception-on-reduceByKey-td2462.html >>> >>> Thanks, >>> >>> Todd >>> >>> -Original Message- >>> From: SK [mailto:skrishna...@gmail.com] >>> Sent: Tuesday, October 7, 2014 2:12

Re: Shuffle files

2014-10-07 Thread Andrew Ash
t;> Thanks, >> >> Todd >> >> -Original Message- >> From: SK [mailto:skrishna...@gmail.com] >> Sent: Tuesday, October 7, 2014 2:12 PM >> To: u...@spark.incubator.apache.org >> Subject: Re: Shuffle files >> >> - We set ulimit to 50. But I

Re: Shuffle files

2014-10-07 Thread Sunny Khatri
files-quot-exception-on-reduceByKey-td2462.html > > Thanks, > > Todd > > -Original Message- > From: SK [mailto:skrishna...@gmail.com] > Sent: Tuesday, October 7, 2014 2:12 PM > To: u...@spark.incubator.apache.org > Subject: Re: Shuffle files > > - We set ulimit to

RE: Shuffle files

2014-10-07 Thread Lisonbee, Todd
es-quot-exception-on-reduceByKey-td2462.html Thanks, Todd -Original Message- From: SK [mailto:skrishna...@gmail.com] Sent: Tuesday, October 7, 2014 2:12 PM To: u...@spark.incubator.apache.org Subject: Re: Shuffle files - We set ulimit to 50. But I still get the same "too many o

Storing shuffle files on a Tachyon

2014-10-07 Thread Soumya Simanta
Is it possible to store spark shuffle files on Tachyon ?

Re: Shuffle files

2014-10-07 Thread SK
ext: http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-files-tp15185p15869.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For

Re: Shuffle files

2014-09-25 Thread Andrew Ash
Hi SK, For the problem with lots of shuffle files and the "too many open files" exception there are a couple options: 1. The linux kernel has a limit on the number of open files at once. This is set with ulimit -n, and can be set permanently in /etc/sysctl.conf or /etc/sysct

Shuffle files

2014-09-25 Thread SK
dException: /tmp/spark-local-20140925215712-0319/12/shuffle_0_99_93138 (Too many open files) basically I think a lot of shuffle files are being created. 1) The tasks eventually fail and the job just hangs (after taking very long, more than an hour). If I read these 30 files in a for loop, th

Re: Spilled shuffle files not being cleared

2014-06-13 Thread Michael Chang
ks > > Saisai > > > > *From:* Michael Chang [mailto:m...@tellapart.com] > *Sent:* Friday, June 13, 2014 10:15 AM > *To:* user@spark.apache.org > *Subject:* Re: Spilled shuffle files not being cleared > > > > Bump > > > > On Mon, Jun 9, 2014 at 3:22 PM, Mich

RE: Spilled shuffle files not being cleared

2014-06-12 Thread Shao, Saisai
spark.cleaner.referenceTracking), and it is enabled by default. Thanks Saisai From: Michael Chang [mailto:m...@tellapart.com] Sent: Friday, June 13, 2014 10:15 AM To: user@spark.apache.org Subject: Re: Spilled shuffle files not being cleared Bump On Mon, Jun 9, 2014 at 3:22 PM, Michael Chang mailto:m

Re: Spilled shuffle files not being cleared

2014-06-12 Thread Michael Chang
Bump On Mon, Jun 9, 2014 at 3:22 PM, Michael Chang wrote: > Hi all, > > I'm seeing exceptions that look like the below in Spark 0.9.1. It looks > like I'm running out of inodes on my machines (I have around 300k each in a > 12 machine cluster). I took a quick look and I'm seeing some shuffle

Spilled shuffle files not being cleared

2014-06-09 Thread Michael Chang
Hi all, I'm seeing exceptions that look like the below in Spark 0.9.1. It looks like I'm running out of inodes on my machines (I have around 300k each in a 12 machine cluster). I took a quick look and I'm seeing some shuffle spill files that are around even around 12 minutes after they are creat

Re: Shuffle Files

2014-03-04 Thread Aniket Mokashi
PM, Usman Ghani wrote: > Where on the filesystem does spark write the shuffle files? > -- "...:::Aniket:::... Quetzalco@tl"

Shuffle Files

2014-03-03 Thread Usman Ghani
Where on the filesystem does spark write the shuffle files?