The link uses wholeTextFiles() API which treats each file as each record.

2016-07-07 15:42 GMT+09:00 Jörn Franke <jornfra...@gmail.com>:

> This does not need necessarily the case if you look at the Hadoop
> FileInputFormat architecture then you can even split large multi line Jsons
> without issues. I would need to have a look at it, but one large file does
> not mean one Executor independent of the underlying format.
>
> On 07 Jul 2016, at 08:12, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
> There is a good link for this here,
> http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
>
> If there are a lot of small files, then it would work pretty okay in a
> distributed manner, but I am worried if it is single large file.
>
> In this case, this would only work in single executor which I think will
> end up with OutOfMemoryException.
>
> Spark JSON data source does not support multi-line JSON as input due to
> the limitation of TextInputFormat and LineRecordReader.
>
> You may have to just extract the values after reading it by textFile..
> ​
>
>
> 2016-07-07 14:48 GMT+09:00 Lan Jiang <ljia...@gmail.com>:
>
>> Hi, there
>>
>> Spark has provided json document processing feature for a long time. In
>> most examples I see, each line is a json object in the sample file. That is
>> the easiest case. But how can we process a json document, which does not
>> conform to this standard format (one line per json object)? Here is the
>> document I am working on.
>>
>> First of all, it is multiple lines for one single big json object. The
>> real file can be as long as 20+ G. Within that one single json object, it
>> contains many name/value pairs. The name is some kind of id values. The
>> value is the actual json object that I would like to be part of dataframe.
>> Is there any way to do that? Appreciate any input.
>>
>>
>> {
>> "id1": {
>> "Title":"title1",
>> "Author":"Tom",
>> "Source":{
>> "Date":"20160506",
>> "Type":"URL"
>> },
>> "Data":" blah blah"},
>>
>> "id2": {
>> "Title":"title2",
>> "Author":"John",
>> "Source":{
>> "Date":"20150923",
>> "Type":"URL"
>> },
>> "Data":" blah blah "},
>>
>> "id3: {
>> "Title":"title3",
>> "Author":"John",
>> "Source":{
>> "Date":"20150902",
>> "Type":"URL"
>> },
>> "Data":" blah blah "}
>> }
>>
>>
>

Reply via email to