I am trying to parse quite a lot large json files.

At the beginning, I am doing like this

textFile(path).map(parseJson(line)).count()

For each file(800 - 900 Mb), it would take roughtly 1 min to finish.

I then changed the code tl

val rawData = textFile(path)
rawData.cache()
rawData.count()

rawData.map(parseJson(line)).count()

So for the first count action, it would take 2 secs for each file/task.
And for parsing, it would take another 2-4secs.

How the time could change so big, from 1min to 4-6 secs?

The problem is I do not have enough memory to cache everything. I am using
jackson json parser coming with the Spark.


Please share your advice  on this.

Thank you !

Reply via email to