I am trying to parse quite a lot large json files. At the beginning, I am doing like this
textFile(path).map(parseJson(line)).count() For each file(800 - 900 Mb), it would take roughtly 1 min to finish. I then changed the code tl val rawData = textFile(path) rawData.cache() rawData.count() rawData.map(parseJson(line)).count() So for the first count action, it would take 2 secs for each file/task. And for parsing, it would take another 2-4secs. How the time could change so big, from 1min to 4-6 secs? The problem is I do not have enough memory to cache everything. I am using jackson json parser coming with the Spark. Please share your advice on this. Thank you !
