Hi Kannan, Both in MapReduce and Spark, the amount of shuffle data a task produces can exceed the tasks memory without risk of OOM.
-Sandy On Tue, Apr 14, 2015 at 6:47 AM, Imran Rashid <iras...@cloudera.com> wrote: > That limit doesn't have anything to do with the amount of available > memory. Its just a tuning parameter, as one version is more efficient for > smaller files, the other is better for bigger files. I suppose the comment > is a little better in FileSegmentManagedBuffer: > > > https://github.com/apache/spark/blob/master/network/common/src/main/java/org/apache/spark/network/buffer/FileSegmentManagedBuffer.java#L62 > > On Tue, Apr 14, 2015 at 12:01 AM, Kannan Rajah <kra...@maprtech.com> > wrote: > > > DiskStore.getBytes uses memory mapped files if the length is more than a > > configured limit. This code path is used during map side shuffle in > > ExternalSorter. I want to know if its possible for the length to exceed > the > > limit in the case of shuffle. The reason I ask is in the case of Hadoop, > > each map task is supposed to produce only data that can fit within the > > task's configured max memory. Otherwise it will result in OOM. Is the > > behavior same in Spark or the size of data generated by a map task can > > exceed what can be fitted in memory. > > > > if (length < minMemoryMapBytes) { > > val buf = ByteBuffer.allocate(length.toInt) > > .... > > } else { > > Some(channel.map(MapMode.READ_ONLY, offset, length)) > > } > > > > -- > > Kannan > > >