Why is parsing a CSV incredibly wasteful with Java Heap memory?

Aris Mon, 13 Oct 2014 13:14:24 -0700

Hi guys,

I am trying just parse out values from a CSV, everything is a numeric
(Double) value, and the input text CSV data is about 1.3 GB in size.


When inspect the Java Heap space used by SparkSubmit using JVisualiser VM,
I end up eating up 8GB of memory! Moreover, by inspecting the BlockManager
input, the original 1.3 GB input CSV file goes into an RDD[Array[Double]],
which is now about 3.6 GB in size from the BlockManager debug output.

This means just parsing the CSV of doubles into a very primitive container
format for Spark almost triples the size of the input... what is going on
here? It gets worse because the TOTAL heap space taken up is over 8GB as I
mentioned.

>From trying to work through Java Heap output and inspecting it with Eclipse
Memory Analyze and Jvisualizer VM, it looks like the parsed-out Double
values end up taking up twice the amount of memory...meaning about 7.2 GB
of memory, which explains most of the Java Heap space.

My code is very simple, does anybody know why I am eating up so much memory?

    val input: RDD[Array[Double]] = sc.textFile(inFile).map(strLine =>
        strLine.split(",").map(_.toDouble)
    ) cache()
    val grouped: RDD[(Double, Iterable[Array[Double]])] =
input.groupBy(_.last)


Thank you!
Aris

Why is parsing a CSV incredibly wasteful with Java Heap memory?

Reply via email to