files. I suppose the
>> comment
>> is a little better in FileSegmentManagedBuffer:
>>
>>
>> https://github.com/apache/spark/blob/master/network/common/src/main/java/org/apache/spark/network/buffer/FileSegmentManagedBuffer.java#L62
>>
>> On Tue, Apr 14,
DiskStore.getBytes uses memory mapped files if the length is more than a
configured limit. This code path is used during map side shuffle in
ExternalSorter. I want to know if its possible for the length to exceed the
limit in the case of shuffle. The reason I ask is in the case of Hadoop,
each map
>> Jerry
>>
>>
>>
>> -Original Message-
>> From: Cheng Lian [mailto:lian.cs@gmail.com]
>> Sent: Wednesday, March 25, 2015 7:40 PM
>> To: Saisai Shao; Kannan Rajah
>> Cc: dev@spark.apache.org
>> Subject: Re: Unders
not globally unique,
>> their full paths should be unique due to these unique per-application
>> subdirectories. Have you observed an instance where this isn't the case?
>>
>> - Josh
>>
>> On Tue, Mar 24, 2015 at 11:04 PM, Kannan Rajah
>> wrote:
&
you run the same job twice, the shuffle dependency as well as shuffle id
> is different, so the shuffle file name which is combined by
> (shuffleId+mapId+reduceId) will be changed, so there's no name conflict
> even in the same directory as I know.
>
> Thanks
> Jerry
>
>
I am working on SPARK-1529. I ran into an issue with my change, where the
same shuffle file was being reused across 2 jobs. Please note this only
happens when I use a hard coded location to use for shuffle files, say
"/tmp". It does not happen with normal code path that uses DiskBlockManager
to pic
According to hive documentation, "sort by" is supposed to order the results
for each reducer. So if we set a single reducer, then the results should be
sorted, right? But this is not happening. Any idea why? Looks like the
settings I am using to restrict the number of reducers is not having an
effe
Is there a recommended performance test for sort based shuffle? Something
similar to terasort on Hadoop. I couldn't find one on the spark-perf code
base.
https://github.com/databricks/spark-perf
--
Kannan