On 18 Oct 2016, at 10:58, Chetan Khatri
<[email protected]<mailto:[email protected]>> wrote:
Dear Xi shen,
Thank you for getting back to question.
The approach i am following are as below:
I have MSSQL server as Enterprise data lack.
1. run Java jobs and generated JSON files, every file is almost 6 GB.
Correct spark need every JSON on separate line, so i did
sed -e 's/}/}\n/g' -s old-file.json > new-file.json
to get every json element on separate lines.
2. uploaded to s3 bucket and reading from their using sqlContext.read.json()
function, where i am getting above error.
Note: If i am running for small size files then i am not getting this error
where JSON elements are almost same structured.
Current approach:
* splitting large JSON(6 GB) to 1-1 GB then will process.
Note: Machine size is , 1 master and 2 slave, each 4 vcore, 26 GB RAM
I see what you are trying to do here: one JSON file per line, then splitting by
line so that you can parallelise JSON processing, as well as holding many JSON
objects in a single s3 file. This is a devious little trick. It just doesn't
work once the json files goes > 2^31 bytes long, as the code to split by line
breaks.
You could write your own input splitter which actually does basic Json parsing,
splitting up by looking for the final } in a JSON clause (harder than you
think, as you need to remember how many {} clauses you have entered and not
include escaped "{" in strings.
a quick google shows some that may be a good starting point
https://github.com/Pivotal-Field-Engineering/pmr-common/blob/master/PivotalMRCommon/src/main/java/com/gopivotal/mapreduce/lib/input/JsonInputFormat.java
https://github.com/alexholmes/json-mapreduce