sorry i meant to say: and my way to deal with OOMs is almost always simply to increase number of partitions. maybe there is a better way that i am not aware of.
On Sat, Feb 13, 2016 at 11:38 PM, Koert Kuipers <ko...@tresata.com> wrote: > thats right, its the reduce operation that makes the in-memory assumption, > not the map (although i am still suspicious that the map actually streams > from disk to disk record by record). > > in reality though my experience is that is spark can not fit partitions in > memory it doesnt work well. i get OOMs. and my to OOMs is almost always > simply to increase number of partitions. maybe there is a better way that i > am not aware of. > > On Sat, Feb 13, 2016 at 6:32 PM, Daniel Darabos < > daniel.dara...@lynxanalytics.com> wrote: > >> >> On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers <ko...@tresata.com> >> wrote: >> >>> in spark, every partition needs to fit in the memory available to the >>> core processing it. >>> >> >> That does not agree with my understanding of how it works. I think you >> could do sc.textFile("input").coalesce(1).map(_.replace("A", >> "B")).saveAsTextFile("output") on a 1 TB local file and it would work fine. >> (I just tested this with a 3 GB file with a 1 GB executor.) >> >> RDDs are mostly implemented using iterators. For example map() operates >> line-by-line, never pulling in the whole partition into memory. coalesce() >> also just concatenates the iterators of the parent partitions into a new >> iterator. >> >> Some operations, like reduceByKey(), need to have the whole contents of >> the partition to work. But they typically use ExternalAppendOnlyMap, so >> they spill to disk instead of filling up the memory. >> >> I know I'm not helping to answer Christopher's issue. Christopher, can >> you perhaps give us an example that we can easily try in spark-shell to see >> the same problem? >> > >