This does not need necessarily the case if you look at the Hadoop FileInputFormat architecture then you can even split large multi line Jsons without issues. I would need to have a look at it, but one large file does not mean one Executor independent of the underlying format.
> On 07 Jul 2016, at 08:12, Hyukjin Kwon <gurwls...@gmail.com> wrote: > > There is a good link for this here, > http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files > > If there are a lot of small files, then it would work pretty okay in a > distributed manner, but I am worried if it is single large file. > > In this case, this would only work in single executor which I think will end > up with OutOfMemoryException. > > Spark JSON data source does not support multi-line JSON as input due to the > limitation of TextInputFormat and LineRecordReader. > > You may have to just extract the values after reading it by textFile.. > > > > 2016-07-07 14:48 GMT+09:00 Lan Jiang <ljia...@gmail.com>: >> Hi, there >> >> Spark has provided json document processing feature for a long time. In most >> examples I see, each line is a json object in the sample file. That is the >> easiest case. But how can we process a json document, which does not conform >> to this standard format (one line per json object)? Here is the document I >> am working on. >> >> First of all, it is multiple lines for one single big json object. The real >> file can be as long as 20+ G. Within that one single json object, it >> contains many name/value pairs. The name is some kind of id values. The >> value is the actual json object that I would like to be part of dataframe. >> Is there any way to do that? Appreciate any input. >> >> >> { >> "id1": { >> "Title":"title1", >> "Author":"Tom", >> "Source":{ >> "Date":"20160506", >> "Type":"URL" >> }, >> "Data":" blah blah"}, >> >> "id2": { >> "Title":"title2", >> "Author":"John", >> "Source":{ >> "Date":"20150923", >> "Type":"URL" >> }, >> "Data":" blah blah "}, >> >> "id3: { >> "Title":"title3", >> "Author":"John", >> "Source":{ >> "Date":"20150902", >> "Type":"URL" >> }, >> "Data":" blah blah "} >> } >