Hi guys, I am trying just parse out values from a CSV, everything is a numeric (Double) value, and the input text CSV data is about 1.3 GB in size.
When inspect the Java Heap space used by SparkSubmit using JVisualiser VM, I end up eating up 8GB of memory! Moreover, by inspecting the BlockManager input, the original 1.3 GB input CSV file goes into an RDD[Array[Double]], which is now about 3.6 GB in size from the BlockManager debug output. This means just parsing the CSV of doubles into a very primitive container format for Spark almost triples the size of the input... what is going on here? It gets worse because the TOTAL heap space taken up is over 8GB as I mentioned. >From trying to work through Java Heap output and inspecting it with Eclipse Memory Analyze and Jvisualizer VM, it looks like the parsed-out Double values end up taking up twice the amount of memory...meaning about 7.2 GB of memory, which explains most of the Java Heap space. My code is very simple, does anybody know why I am eating up so much memory? val input: RDD[Array[Double]] = sc.textFile(inFile).map(strLine => strLine.split(",").map(_.toDouble) ) cache() val grouped: RDD[(Double, Iterable[Array[Double]])] = input.groupBy(_.last) Thank you! Aris