I've raised the JSON-related ticket at
https://issues.apache.org/jira/browse/SPARK-7366.
@Ewan I think it would be great to support multiline CSV records too.
The motivation is very similar but my instinct is that little/nothing
of the implementation could be usefully shared, so it's better as a
s
FWIW, CSV has the same problem that renders it immune to naive partitioning.
Consider the following RFC 4180 compliant record:
1,2,"
all,of,these,are,just,one,field
",4,5
Now, it's probably a terrible idea to give a file system awareness of
actual file types, but couldn't HDFS handle this near
@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?
I suspect the algorithm is going to be bit fiddly and would definitely benefit
from multiple heads. If possible, I think we should handle pathological cases
like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out.
@joe, I'd be glad to help if you need.
Le lun. 4 mai 2015 à 20:06, Matei Zaharia a
écrit :
> I don't know whether this is common, but we might also allow another
> separator for JSON objects, such as two blank lines.
>
> Matei
>
> > On May 4, 2015, at 2:28 PM, Reynold Xin wrote:
> >
> > Joe - I
I don't know whether this is common, but we might also allow another separator
for JSON objects, such as two blank lines.
Matei
> On May 4, 2015, at 2:28 PM, Reynold Xin wrote:
>
> Joe - I think that's a legit and useful thing to do. Do you want to give it
> a shot?
>
> On Mon, May 4, 2015 at
Joe - I think that's a legit and useful thing to do. Do you want to give it
a shot?
On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell
wrote:
> I think Reynold’s argument shows the impossibility of the general case.
>
> But a “maximum object depth” hint could enable a new input format to do
> its jo
It's not JSON, per se, but data formats like smile (
http://en.wikipedia.org/wiki/Smile_%28data_interchange_format%29) provide
support for markers that can't be confused with content and also provide
reasonably similar ergonomics.
—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
I was wondering if it's possible to use existing Hive SerDes for this ?
Le lun. 4 mai 2015 à 08:36, Joe Halliwell a
écrit :
> I think Reynold’s argument shows the impossibility of the general case.
>
> But a “maximum object depth” hint could enable a new input format to do
> its job both efficie
I think Reynold’s argument shows the impossibility of the general case.
But a “maximum object depth” hint could enable a new input format to do its job
both efficiently and correctly in the common case where the input is an array
of similarly structured objects! I’d certainly be interested in
I took a quick look at that implementation. I'm not sure if it actually
handles JSON correctly, because it attempts to find the first { starting
from a random point. However, that random point could be in the middle of a
string, and thus the first { might just be part of a string, rather than a
rea
You can check out the following library:
https://github.com/alexholmes/json-mapreduce
--
Emre Sevinç
On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:
> Hi everyone,
> Is there any way in Spark SQL to load multi-line JSON data efficiently, I
> think
I'll try to study that and get back to you.
Regards,
Olivier.
Le lun. 4 mai 2015 à 04:05, Reynold Xin a écrit :
> How does the pivotal format decides where to split the files? It seems to
> me the challenge is to decide that, and on the top of my head the only way
> to do this is to scan from t
How does the pivotal format decides where to split the files? It seems to
me the challenge is to decide that, and on the top of my head the only way
to do this is to scan from the beginning and parse the json properly, which
makes it not possible with large files (doable for whole input with a lot
13 matches
Mail list logo