sorry i meant to say:
and my way to deal with OOMs is almost always simply to increase number of
partitions. maybe there is a better way that i am not aware of.

On Sat, Feb 13, 2016 at 11:38 PM, Koert Kuipers <ko...@tresata.com> wrote:

> thats right, its the reduce operation that makes the in-memory assumption,
> not the map (although i am still suspicious that the map actually streams
> from disk to disk record by record).
>
> in reality though my experience is that is spark can not fit partitions in
> memory it doesnt work well. i get OOMs. and my to OOMs is almost always
> simply to increase number of partitions. maybe there is a better way that i
> am not aware of.
>
> On Sat, Feb 13, 2016 at 6:32 PM, Daniel Darabos <
> daniel.dara...@lynxanalytics.com> wrote:
>
>>
>> On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>
>>> in spark, every partition needs to fit in the memory available to the
>>> core processing it.
>>>
>>
>> That does not agree with my understanding of how it works. I think you
>> could do sc.textFile("input").coalesce(1).map(_.replace("A",
>> "B")).saveAsTextFile("output") on a 1 TB local file and it would work fine.
>> (I just tested this with a 3 GB file with a 1 GB executor.)
>>
>> RDDs are mostly implemented using iterators. For example map() operates
>> line-by-line, never pulling in the whole partition into memory. coalesce()
>> also just concatenates the iterators of the parent partitions into a new
>> iterator.
>>
>> Some operations, like reduceByKey(), need to have the whole contents of
>> the partition to work. But they typically use ExternalAppendOnlyMap, so
>> they spill to disk instead of filling up the memory.
>>
>> I know I'm not helping to answer Christopher's issue. Christopher, can
>> you perhaps give us an example that we can easily try in spark-shell to see
>> the same problem?
>>
>
>

Reply via email to