Re: About Error while reading large JSON file in Spark

Steve Loughran Wed, 19 Oct 2016 04:56:05 -0700

On 18 Oct 2016, at 10:58, Chetan Khatri 
<[email protected]<mailto:[email protected]>> wrote:

Dear Xi shen,

Thank you for getting back to question.

The approach i am following are as below:
I have MSSQL server as Enterprise data lack.

1. run Java jobs and generated JSON files, every file is almost 6 GB.
Correct spark need every JSON on separate line, so i did
sed -e 's/}/}\n/g' -s old-file.json > new-file.json
to get every json element on separate lines.
2. uploaded to s3 bucket and reading from their using sqlContext.read.json()
function, where i am getting above error.

Note: If i am running for small size files then i am not getting this error
where JSON elements are almost same structured.

Current approach:

* splitting large JSON(6 GB) to 1-1 GB then will process.

Note: Machine size is , 1 master and 2 slave, each 4 vcore, 26 GB RAM

I see what you are trying to do here: one JSON file per line, then splitting by
line so that you can parallelise JSON processing, as well as holding many JSON
objects in a single s3 file. This is a devious little trick. It just doesn't
work once the json files goes > 2^31 bytes long, as the code to split by line
breaks.

You could write your own input splitter which actually does basic Json parsing,
splitting up by looking for the final } in a JSON clause (harder than you
think, as you need to remember how many {} clauses you have entered and not
include escaped "{" in strings.

a quick google shows some that may be a good starting point

https://github.com/Pivotal-Field-Engineering/pmr-common/blob/master/PivotalMRCommon/src/main/java/com/gopivotal/mapreduce/lib/input/JsonInputFormat.java
https://github.com/alexholmes/json-mapreduce

Re: About Error while reading large JSON file in Spark

Reply via email to