>> Jerry
>>
>>
>>
>> -Original Message-
>> From: Cheng Lian [mailto:lian.cs@gmail.com]
>> Sent: Wednesday, March 25, 2015 7:40 PM
>> To: Saisai Shao; Kannan Rajah
>> Cc: dev@spark.apache.org
>> Subject: Re: Unders
.com]
Sent: Wednesday, March 25, 2015 7:40 PM
To: Saisai Shao; Kannan Rajah
Cc: dev@spark.apache.org
Subject: Re: Understanding shuffle file name conflicts
Hi Jerry & Josh
It has been a while since the last time I looked into Spark core shuffle code,
maybe I’m wrong here. But the shuffle ID is
Sent: Wednesday, March 25, 2015 7:40 PM
To: Saisai Shao; Kannan Rajah
Cc: dev@spark.apache.org
Subject: Re: Understanding shuffle file name conflicts
Hi Jerry & Josh
It has been a while since the last time I looked into Spark core shuffle code,
maybe I’m wrong here. But the shuffle ID is crea
Hi Jerry & Josh
It has been a while since the last time I looked into Spark core shuffle
code, maybe I’m wrong here. But the shuffle ID is created along with
ShuffleDependency, which is part of the RDD DAG. So if we submit
multiple jobs over the same RDD DAG, I think the shuffle IDs in these
DIskBlockManager doesn't need to know the app id, all it need to do is to
create a folder with a unique name (UUID based) and then put all the
shuffle files into it.
you can see the code in DiskBlockManager as below, it will create a bunch
unique folders when initialized, these folders are app spe
Josh & Saisai,
When I say I am using a hardcoded location for shuffle files, I mean that I
am not using DiskBlockManager.getFile API because that uses the directories
created locally on the node. But for my use case, I need to look at
creating those shuffle files on HDFS.
I will take a closer look
Yes as Josh said, when application is started, Spark will create a unique
application-wide folder for related temporary files. And jobs in this
application will have a unique shuffle id with unique file names, so
shuffle stages within app will not meet name conflicts.
Also shuffle files between ap
Which version of Spark are you using? What do you mean when you say that
you used a hardcoded location for shuffle files?
If you look at the current DiskBlockManager code, it looks like it will
create a per-application subdirectory in each of the local root directories.
Here's the call to create
Saisai,
This is the not the case when I use spark-submit to run 2 jobs, one after
another. The shuffle id remains the same.
--
Kannan
On Tue, Mar 24, 2015 at 7:35 PM, Saisai Shao wrote:
> Hi Kannan,
>
> As I know the shuffle Id in ShuffleDependency will be increased, so even
> if you run the s
Hi Kannan,
As I know the shuffle Id in ShuffleDependency will be increased, so even if
you run the same job twice, the shuffle dependency as well as shuffle id is
different, so the shuffle file name which is combined by
(shuffleId+mapId+reduceId) will be changed, so there's no name conflict
even i
10 matches
Mail list logo