On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers <ko...@tresata.com> wrote:
> in spark, every partition needs to fit in the memory available to the core > processing it. > That does not agree with my understanding of how it works. I think you could do sc.textFile("input").coalesce(1).map(_.replace("A", "B")).saveAsTextFile("output") on a 1 TB local file and it would work fine. (I just tested this with a 3 GB file with a 1 GB executor.) RDDs are mostly implemented using iterators. For example map() operates line-by-line, never pulling in the whole partition into memory. coalesce() also just concatenates the iterators of the parent partitions into a new iterator. Some operations, like reduceByKey(), need to have the whole contents of the partition to work. But they typically use ExternalAppendOnlyMap, so they spill to disk instead of filling up the memory. I know I'm not helping to answer Christopher's issue. Christopher, can you perhaps give us an example that we can easily try in spark-shell to see the same problem?