Re: Multi-Line JSON in SparkSQL

2015-05-05 Thread Joe Halliwell
I've raised the JSON-related ticket at https://issues.apache.org/jira/browse/SPARK-7366. @Ewan I think it would be great to support multiline CSV records too. The motivation is very similar but my instinct is that little/nothing of the implementation could be usefully shared, so it's better as a s

Re: Multi-Line JSON in SparkSQL

2015-05-05 Thread Ewan Higgs
FWIW, CSV has the same problem that renders it immune to naive partitioning. Consider the following RFC 4180 compliant record: 1,2," all,of,these,are,just,one,field ",4,5 Now, it's probably a terrible idea to give a file system awareness of actual file types, but couldn't HDFS handle this near

Re: Multi-Line JSON in SparkSQL

2015-05-05 Thread Joe Halliwell
@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket? I suspect the algorithm is going to be bit fiddly and would definitely benefit from multiple heads. If possible, I think we should handle pathological cases like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out.

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Olivier Girardot
@joe, I'd be glad to help if you need. Le lun. 4 mai 2015 à 20:06, Matei Zaharia a écrit : > I don't know whether this is common, but we might also allow another > separator for JSON objects, such as two blank lines. > > Matei > > > On May 4, 2015, at 2:28 PM, Reynold Xin wrote: > > > > Joe - I

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Matei Zaharia
I don't know whether this is common, but we might also allow another separator for JSON objects, such as two blank lines. Matei > On May 4, 2015, at 2:28 PM, Reynold Xin wrote: > > Joe - I think that's a legit and useful thing to do. Do you want to give it > a shot? > > On Mon, May 4, 2015 at

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Reynold Xin
Joe - I think that's a legit and useful thing to do. Do you want to give it a shot? On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell wrote: > I think Reynold’s argument shows the impossibility of the general case. > > But a “maximum object depth” hint could enable a new input format to do > its jo

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Paul Brown
It's not JSON, per se, but data formats like smile ( http://en.wikipedia.org/wiki/Smile_%28data_interchange_format%29) provide support for markers that can't be confused with content and also provide reasonably similar ergonomics. — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Olivier Girardot
I was wondering if it's possible to use existing Hive SerDes for this ? Le lun. 4 mai 2015 à 08:36, Joe Halliwell a écrit : > I think Reynold’s argument shows the impossibility of the general case. > > But a “maximum object depth” hint could enable a new input format to do > its job both efficie

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Joe Halliwell
I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do its job both efficiently and correctly in the common case where the input is an array of similarly structured objects! I’d certainly be interested in

Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Reynold Xin
I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a rea

Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Emre Sevinc
You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi everyone, > Is there any way in Spark SQL to load multi-line JSON data efficiently, I > think

Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Olivier Girardot
I'll try to study that and get back to you. Regards, Olivier. Le lun. 4 mai 2015 à 04:05, Reynold Xin a écrit : > How does the pivotal format decides where to split the files? It seems to > me the challenge is to decide that, and on the top of my head the only way > to do this is to scan from t

Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Reynold Xin
How does the pivotal format decides where to split the files? It seems to me the challenge is to decide that, and on the top of my head the only way to do this is to scan from the beginning and parse the json properly, which makes it not possible with large files (doable for whole input with a lot