This does not need necessarily the case if you look at the Hadoop 
FileInputFormat architecture then you can even split large multi line Jsons 
without issues. I would need to have a look at it, but one large file does not 
mean one Executor independent of the underlying format.

> On 07 Jul 2016, at 08:12, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> 
> There is a good link for this here, 
> http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
> 
> If there are a lot of small files, then it would work pretty okay in a 
> distributed manner, but I am worried if it is single large file.
> 
> In this case, this would only work in single executor which I think will end 
> up with OutOfMemoryException.
> 
> Spark JSON data source does not support multi-line JSON as input due to the 
> limitation of TextInputFormat and LineRecordReader.
> 
> You may have to just extract the values after reading it by textFile..
> 
> 
> 
> 2016-07-07 14:48 GMT+09:00 Lan Jiang <ljia...@gmail.com>:
>> Hi, there
>> 
>> Spark has provided json document processing feature for a long time. In most 
>> examples I see, each line is a json object in the sample file. That is the 
>> easiest case. But how can we process a json document, which does not conform 
>> to this standard format (one line per json object)? Here is the document I 
>> am working on. 
>> 
>> First of all, it is multiple lines for one single big json object. The real 
>> file can be as long as 20+ G. Within that one single json object, it 
>> contains many name/value pairs. The name is some kind of id values. The 
>> value is the actual json object that I would like to be part of dataframe. 
>> Is there any way to do that? Appreciate any input. 
>> 
>> 
>> {
>>     "id1": {
>>     "Title":"title1",
>>     "Author":"Tom",
>>     "Source":{
>>         "Date":"20160506",
>>         "Type":"URL"
>>     },
>>     "Data":" blah blah"},
>> 
>>     "id2": {
>>     "Title":"title2",
>>     "Author":"John",
>>     "Source":{
>>         "Date":"20150923",
>>         "Type":"URL"
>>     },
>>     "Data":"  blah blah "},
>> 
>>     "id3: {
>>     "Title":"title3",
>>     "Author":"John",
>>     "Source":{
>>         "Date":"20150902",
>>         "Type":"URL"
>>     },
>>     "Data":" blah blah "}
>> }
> 

Reply via email to