u raised
>
> "" One of the offerings from the service we use is EBS migration which
> basically means if a host is about to get evicted, a new host is created
> and the EBS volume is attached to it. When Spark assigns a new executor
> to the newly created instance, it basically
e offerings from the service we use is EBS migration which
basically means if a host is about to get evicted, a new host is created
and the EBS volume is attached to it. When Spark assigns a new executor to
the newly created instance, it basically can recover all the shuffle files
that are already persi
heavily.
> > Have you looked at why you are having these shuffles? What is the cause of
> > these large transformations ending up in shuffle
> >
> > Also on your point:
> > "..then ideally we should expect that when an executor is killed/OOM'd
> >
expect that when an executor is killed/OOM'd
> and a new executor is spawned on the same host, the new executor registers
> the shuffle files to itself. Is that so?"
>
> What guarantee is that the new executor with inherited shuffle files will
> succeed?
>
> Als
en an executor is killed/OOM'd and
a new executor is spawned on the same host, the new executor registers the
shuffle files to itself. Is that so?"
What guarantee is that the new executor with inherited shuffle files will
succeed?
Also OOM is often associated with some form of skewed d
ed
to it
When Spark assigns a new executor to the newly created instance, it
basically can recover all the shuffle files that are already persisted in
the migrated EBS volume
Is this how it works? Do executors recover / re-register the shuffle files
that they found?
So far I have not come acros
ok thanks. guess i am simply misremembering that i saw the shuffle files
getting re-used across jobs (actions). it was probably across stages for
the same job.
in structured streaming this is a pretty big deal. if you join a streaming
dataframe with a large static dataframe each microbatch
Spark can reuse shuffle stages in the same job (action), not cross jobs.
From: Koert Kuipers
Sent: Saturday, July 16, 2022 6:43 PM
To: user
Subject: [EXTERNAL] spark re-use shuffle files not happening
ATTENTION: This email originated from outside of GM.
i
i have seen many jobs where spark re-uses shuffle files (and skips a stage
of a job), which is an awesome feature given how expensive shuffles are,
and i generally now assume this will happen.
however i feel like i am going a little crazy today. i did the simplest
test in spark 3.3.0, basically i
Hello,
I have answered it on the Stack Overflow.
Best Regards,
Attila
On Wed, May 12, 2021 at 4:57 PM Chris Thomas
wrote:
> Hi,
>
> I am pretty confident I have observed Spark configured with the Shuffle
> Service continuing to fetch shuffle files on a node in the event of
> e
Hi,
I am pretty confident I have observed Spark configured with the Shuffle Service
continuing to fetch shuffle files on a node in the event of executor failure,
rather than recompute the shuffle files as happens without the Shuffle Service.
Can anyone confirm this?
(I have a SO question
You can also look at the shuffle file cleanup tricks we do inside of the
ALS algorithm in Spark.
On Fri, Feb 23, 2018 at 6:20 PM, vijay.bvp wrote:
> have you looked at
> http://apache-spark-user-list.1001560.n3.nabble.com/Limit-
> Spark-Shuffle-Disk-Usage-td23279.html
>
> and the post mentioned
have you looked at
http://apache-spark-user-list.1001560.n3.nabble.com/Limit-Spark-Shuffle-Disk-Usage-td23279.html
and the post mentioned there
https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html
also try compressing the output
https://spark.apache.o
Got it. I understood issue in different way.
On Thu, Feb 22, 2018 at 9:19 PM Keith Chapman
wrote:
> My issue is that there is not enough pressure on GC, hence GC is not
> kicking in fast enough to delete the shuffle files of previous iterations.
>
> Regards,
> Keith.
&
My issue is that there is not enough pressure on GC, hence GC is not
kicking in fast enough to delete the shuffle files of previous iterations.
Regards,
Keith.
http://keith-chapman.com
On Thu, Feb 22, 2018 at 6:58 PM, naresh Goud
wrote:
> It would be very difficult to tell without know
ation that I run out of disk space
> on the /tmp directory. On further investigation I was able to figure out
> that the reason for this is that the shuffle files are still around,
> because I have a very large hear GC has not happen and hence the shuffle
> files are not deleted. I was a
ce on the /tmp
directory. On further investigation I was able to figure out that the
reason for this is that the shuffle files are still around, because I have
a very large hear GC has not happen and hence the shuffle files are not
deleted. I was able to confirm this by lowering the heap size and I s
When the RDD using them goes out of scope.
On Mon, Mar 27, 2017 at 3:13 PM, Ashwin Sai Shankar
wrote:
> Thanks Mark! follow up question, do you know when shuffle files are
> usually un-referenced?
>
> On Mon, Mar 27, 2017 at 2:35 PM, Mark Hamstra
> wrote:
>
>> Shuffl
Thanks Mark! follow up question, do you know when shuffle files are usually
un-referenced?
On Mon, Mar 27, 2017 at 2:35 PM, Mark Hamstra
wrote:
> Shuffle files are cleaned when they are no longer referenced. See
> https://github.com/apache/spark/blob/master/core/src/
> main/scala/o
Shuffle files are cleaned when they are no longer referenced. See
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala
On Mon, Mar 27, 2017 at 12:38 PM, Ashwin Sai Shankar <
ashan...@netflix.com.invalid> wrote:
> Hi!
>
> In spark on
Hi!
In spark on yarn, when are shuffle files on local disk removed? (Is it when
the app completes or
once all the shuffle files are fetched or end of the stage?)
Thanks,
Ashwin
Hi,
I'm running into consistent failures during a shuffle read while trying to
do a group-by followed by a count aggregation (using the DataFrame API on
Spark 1.5.2).
The shuffle read (in stage 1) fails with
org.apache.spark.shuffle.FetchFailedException: Failed to send RPC
7719188499899260109 to
y some google
> search, but in vain..
> any idea?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/FileNotFoundException-in-appcache-shuffle-files-tp17605p25663.html
> Sent from the Apache Spark User Lis
-shuffle-files-tp17605p25663.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Hi all,
We are running a class with Pyspark notebook for data analysis. Some of
the books are fairly long and have a lot of operations. Through the
course of the notebook, the shuffle storage expands considerably and
often exceeds quota (e.g. 1.5GB input expands to 24GB in shuffle
files). Closing
pus)), the algorithm start to create shuffle files
on my disk at at point that it fills it up till there is space left.
I am using spark-submit to run my program as follow:
spark-submit --driver-memory 14G --class
com.heystaks.spark.ml.topicmodelling.LDAExample
./target/scala-2.10/lda-assembly-1.
used by seeing skipped stages in the job UI. They are
> periodically cleaned up based on available space of the configured
> spark.local.dirs paths.
>
> From: Thomas Gerber
> Date: Monday, June 29, 2015 at 10:12 PM
> To: user
> Subject: Shuffle files lifecycle
>
> Hello
spark.local.dirs paths.
From: Thomas Gerber
Date: Monday, June 29, 2015 at 10:12 PM
To: user
Subject: Shuffle files lifecycle
Hello,
It is my understanding that shuffle are written on disk and that they act as
checkpoints.
I wonder if this is true only within a job, or across jobs. Please note that I
act
> as checkpoints.
>
> I wonder if this is true only within a job, or across jobs. Please note
> that I use the words job and stage carefully here.
>
> 1. can a shuffle created during JobN be used to skip many stages from
> JobN+1? Or is the lifecycle of the shuffle files bound
+1? Or is the lifecycle of the shuffle files bound to the job that
created them?
2. when are shuffle files actually deleted? Is it TTL based or is it
cleaned when the job is over?
3. we have a very long batch application, and as it goes on, the number of
total tasks for each job gets larger and
Hi TD,
That little experiment helped a bit. This time we did not see any
exceptions for about 16 hours but eventually it did throw the same
exceptions as before. The cleaning of the shuffle files also stopped much
before these exceptions happened - about 7-1/2 hours after startup.
I am not quite
ing application? Was it falling behind
> with a large increasing scheduling delay?
>
> TD
>
> On Thu, Apr 23, 2015 at 11:31 AM, N B wrote:
>
>> Thanks for the response, Conor. I tried with those settings and for a
>> while it seemed like it was cleaning up shuffle files after
What was the state of your streaming application? Was it falling behind
with a large increasing scheduling delay?
TD
On Thu, Apr 23, 2015 at 11:31 AM, N B wrote:
> Thanks for the response, Conor. I tried with those settings and for a
> while it seemed like it was cleaning up shuffle
Thanks for the response, Conor. I tried with those settings and for a while
it seemed like it was cleaning up shuffle files after itself. However,
after exactly 5 hours later it started throwing exceptions and eventually
stopped working again. A sample stack trace is below. What is curious about
5
Hi,
We set the spark.cleaner.ttl to some reasonable time and also
set spark.streaming.unpersist=true.
Those together cleaned up the shuffle files for us.
-Conor
On Tue, Apr 21, 2015 at 8:18 AM, N B wrote:
> We already do have a cron job in place to clean just the shuffle files.
> H
We already do have a cron job in place to clean just the shuffle files.
However, what I would really like to know is whether there is a "proper"
way of telling spark to clean up these files once its done with them?
Thanks
NB
On Mon, Apr 20, 2015 at 10:47 AM, Jeetendra Gangele
wrote:
ame "spark-*-*-*" -prune -exec rm -rf {} \+
On 20 April 2015 at 23:12, N B wrote:
> Hi all,
>
> I had posed this query as part of a different thread but did not get a
> response there. So creating a new thread hoping to catch someone's
> attention.
>
> We are exper
Hi all,
I had posed this query as part of a different thread but did not get a
response there. So creating a new thread hoping to catch someone's
attention.
We are experiencing this issue of shuffle files being left behind and not
being cleaned up by Spark. Since this is a Spark stre
gt; lost in the UI). If I don't coalesce, I pretty immediately get Java heap
>>> space exceptions that kill the job altogether.
>>>
>>> Putting in the timeouts didn't seem to help the case where I am
>>> coalescing. Also, I don't see any dfference
pretty immediately get Java heap
>> space exceptions that kill the job altogether.
>>
>> Putting in the timeouts didn't seem to help the case where I am
>> coalescing. Also, I don't see any dfferences between 'disk only' and
>> 'memory and disk
t's
>> looking better...
>>
>> On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet wrote:
>>
>>> I'm looking @ my yarn container logs for some of the executors which
>>> appear to be failing (with the missing shuffle files). I see exceptions
>>> that
don't see any dfferences between 'disk only' and 'memory and disk'
storage levels- both of them are having the same problems. I notice large
shuffle files (30-40gb) that only seem to spill a few hundred mb.
On Mon, Feb 23, 2015 at 4:28 PM, Anders Arpteg wrote:
> Sounds ver
ing better...
On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet wrote:
> I'm looking @ my yarn container logs for some of the executors which
> appear to be failing (with the missing shuffle files). I see exceptions
> that say "client.TransportClientFactor: Found inactive connection
I'm looking @ my yarn container logs for some of the executors which appear
to be failing (with the missing shuffle files). I see exceptions that say
"client.TransportClientFactor: Found inactive connection to host/ip:port,
closing it."
Right after that I see "shuffle.
over 1.3TB of memory
>> allocated for the application. I was thinking perhaps it was possible that
>> a single executor was getting a single or a couple large partitions but
>> shouldn't the disk persistence kick in at that point?
>>
>> On Sat, Feb 21, 2015 at 11:20
cation. I was thinking perhaps it was possible that
> a single executor was getting a single or a couple large partitions but
> shouldn't the disk persistence kick in at that point?
>
> On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg
> wrote:
>
>> For large jobs, the follo
Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg <mailto:arp...@spotify.com>> wrote:
For large jobs, the following error message is shown that seems to
indicate that shuffle files for some reason are missing. It's a
rather large job with many partitions. If the data size
ps it was possible that
a single executor was getting a single or a couple large partitions but
shouldn't the disk persistence kick in at that point?
On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg wrote:
> For large jobs, the following error message is shown that seems to
> indicate th
For large jobs, the following error message is shown that seems to indicate
that shuffle files for some reason are missing. It's a rather large job
with many partitions. If the data size is reduced, the problem disappears.
I'm running a build from Spark master post 1.2 (build at 2015-
r-Errno-2-No-such-file-or-directory-tmp-spark-9e23f17e-2e23-4c26-9621-3cb4d8b832da-tmp3i3xno-td21076.html
>
> Thank you,
> Lucio
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/FileNotFoundException-in-appcache-shuffle-files
Thank you,
Lucio
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/FileNotFoundException-in-appcache-shuffle-files-tp17605p21077.html
Sent from the Apache Spark User List mailing list archive at
ser/201410.mbox/
-Original Message-
From: Shao, Saisai [saisai.s...@intel.com<mailto:saisai.s...@intel.com>]
Sent: Wednesday, October 29, 2014 01:46 AM Eastern Standard Time
To: Ryan Williams
Cc: user
Subject: RE: FileNotFoundException in appcache shuffle files
Hi Ryan,
This is an issue from so
h7kf8/adam.108?dl=0
> [2]
> http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCANGvG8qtK57frWS+kaqTiUZ9jSLs5qJKXXjXTTQ9eh2-GsrmpA@...%3E
> <http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3ccangvg8qtk57frws+kaqtiuz9jsls5qjkxxjxttq9eh2-gsr...@mail.gma
.
Thanks
Jerry
From: nobigdealst...@gmail.com [mailto:nobigdealst...@gmail.com] On Behalf Of
Ryan Williams
Sent: Wednesday, October 29, 2014 1:31 PM
To: user
Subject: FileNotFoundException in appcache shuffle files
My job is failing with the following error:
14/10/29 02:59:14 WARN
My job is failing with the following error:
14/10/29 02:59:14 WARN scheduler.TaskSetManager: Lost task 1543.0 in stage
3.0 (TID 6266, demeter-csmau08-19.demeter.hpc.mssm.edu):
java.io.FileNotFoundException:
/data/05/dfs/dn/yarn/nm/usercache/willir31/appcache/application_1413512480649_0108/spark-lo
Cc: Sunny Khatri; Lisonbee, Todd; u...@spark.incubator.apache.org
Subject: Re: Shuffle files
My observation is opposite. When my job runs under default
spark.shuffle.manager, I don't see this exception. However, when it runs with
SORT based, I start seeing this error? How would that be pos
1560.n3.nabble.com/quot-Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
>>>
>>> Thanks,
>>>
>>> Todd
>>>
>>> -Original Message-
>>> From: SK [mailto:skrishna...@gmail.com]
>>> Sent: Tuesday, October 7, 2014 2:12
t;> Thanks,
>>
>> Todd
>>
>> -Original Message-
>> From: SK [mailto:skrishna...@gmail.com]
>> Sent: Tuesday, October 7, 2014 2:12 PM
>> To: u...@spark.incubator.apache.org
>> Subject: Re: Shuffle files
>>
>> - We set ulimit to 50. But I
files-quot-exception-on-reduceByKey-td2462.html
>
> Thanks,
>
> Todd
>
> -Original Message-
> From: SK [mailto:skrishna...@gmail.com]
> Sent: Tuesday, October 7, 2014 2:12 PM
> To: u...@spark.incubator.apache.org
> Subject: Re: Shuffle files
>
> - We set ulimit to
es-quot-exception-on-reduceByKey-td2462.html
Thanks,
Todd
-Original Message-
From: SK [mailto:skrishna...@gmail.com]
Sent: Tuesday, October 7, 2014 2:12 PM
To: u...@spark.incubator.apache.org
Subject: Re: Shuffle files
- We set ulimit to 50. But I still get the same "too many o
Is it possible to store spark shuffle files on Tachyon ?
ext:
http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-files-tp15185p15869.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For
Hi SK,
For the problem with lots of shuffle files and the "too many open files"
exception there are a couple options:
1. The linux kernel has a limit on the number of open files at once. This
is set with ulimit -n, and can be set permanently in /etc/sysctl.conf or
/etc/sysct
dException:
/tmp/spark-local-20140925215712-0319/12/shuffle_0_99_93138 (Too many open
files)
basically I think a lot of shuffle files are being created.
1) The tasks eventually fail and the job just hangs (after taking very long,
more than an hour). If I read these 30 files in a for loop, th
ks
>
> Saisai
>
>
>
> *From:* Michael Chang [mailto:m...@tellapart.com]
> *Sent:* Friday, June 13, 2014 10:15 AM
> *To:* user@spark.apache.org
> *Subject:* Re: Spilled shuffle files not being cleared
>
>
>
> Bump
>
>
>
> On Mon, Jun 9, 2014 at 3:22 PM, Mich
spark.cleaner.referenceTracking),
and it is enabled by default.
Thanks
Saisai
From: Michael Chang [mailto:m...@tellapart.com]
Sent: Friday, June 13, 2014 10:15 AM
To: user@spark.apache.org
Subject: Re: Spilled shuffle files not being cleared
Bump
On Mon, Jun 9, 2014 at 3:22 PM, Michael Chang
mailto:m
Bump
On Mon, Jun 9, 2014 at 3:22 PM, Michael Chang wrote:
> Hi all,
>
> I'm seeing exceptions that look like the below in Spark 0.9.1. It looks
> like I'm running out of inodes on my machines (I have around 300k each in a
> 12 machine cluster). I took a quick look and I'm seeing some shuffle
Hi all,
I'm seeing exceptions that look like the below in Spark 0.9.1. It looks
like I'm running out of inodes on my machines (I have around 300k each in a
12 machine cluster). I took a quick look and I'm seeing some shuffle spill
files that are around even around 12 minutes after they are creat
PM, Usman Ghani wrote:
> Where on the filesystem does spark write the shuffle files?
>
--
"...:::Aniket:::... Quetzalco@tl"
Where on the filesystem does spark write the shuffle files?
70 matches
Mail list logo