Re: Using memory mapped file for shuffle

2015-04-14 Thread Kannan Rajah
files. I suppose the >> comment >> is a little better in FileSegmentManagedBuffer: >> >> >> https://github.com/apache/spark/blob/master/network/common/src/main/java/org/apache/spark/network/buffer/FileSegmentManagedBuffer.java#L62 >> >> On Tue, Apr 14,

Using memory mapped file for shuffle

2015-04-13 Thread Kannan Rajah
DiskStore.getBytes uses memory mapped files if the length is more than a configured limit. This code path is used during map side shuffle in ExternalSorter. I want to know if its possible for the length to exceed the limit in the case of shuffle. The reason I ask is in the case of Hadoop, each map

Re: Understanding shuffle file name conflicts

2015-03-26 Thread Kannan Rajah
>> Jerry >> >> >> >> -Original Message- >> From: Cheng Lian [mailto:lian.cs@gmail.com] >> Sent: Wednesday, March 25, 2015 7:40 PM >> To: Saisai Shao; Kannan Rajah >> Cc: dev@spark.apache.org >> Subject: Re: Unders

Re: Understanding shuffle file name conflicts

2015-03-25 Thread Kannan Rajah
not globally unique, >> their full paths should be unique due to these unique per-application >> subdirectories. Have you observed an instance where this isn't the case? >> >> - Josh >> >> On Tue, Mar 24, 2015 at 11:04 PM, Kannan Rajah >> wrote: &

Re: Understanding shuffle file name conflicts

2015-03-24 Thread Kannan Rajah
you run the same job twice, the shuffle dependency as well as shuffle id > is different, so the shuffle file name which is combined by > (shuffleId+mapId+reduceId) will be changed, so there's no name conflict > even in the same directory as I know. > > Thanks > Jerry > >

Understanding shuffle file name conflicts

2015-03-24 Thread Kannan Rajah
I am working on SPARK-1529. I ran into an issue with my change, where the same shuffle file was being reused across 2 jobs. Please note this only happens when I use a hard coded location to use for shuffle files, say "/tmp". It does not happen with normal code path that uses DiskBlockManager to pic

Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

2015-02-18 Thread Kannan Rajah
According to hive documentation, "sort by" is supposed to order the results for each reducer. So if we set a single reducer, then the results should be sorted, right? But this is not happening. Any idea why? Looks like the settings I am using to restrict the number of reducers is not having an effe

Performance test for sort shuffle

2015-02-02 Thread Kannan Rajah
Is there a recommended performance test for sort based shuffle? Something similar to terasort on Hadoop. I couldn't find one on the spark-perf code base. https://github.com/databricks/spark-perf -- Kannan