Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-06 Thread Nicholas Chammas
Nice catch Brad and thanks to Yin and Davies for getting on it so quickly. On Wed, Aug 6, 2014 at 2:45 AM, Davies Liu wrote: > There is a PR to fix this: https://github.com/apache/spark/pull/1802 > > On Tue, Aug 5, 2014 at 10:11 PM, Brad Miller > wrote: > > I concur that printSchema works; it

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Davies Liu
There is a PR to fix this: https://github.com/apache/spark/pull/1802 On Tue, Aug 5, 2014 at 10:11 PM, Brad Miller wrote: > I concur that printSchema works; it just seems to be operations that use the > data where trouble happens. > > Thanks for posting the bug. > > -Brad > > > On Tue, Aug 5, 2014

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
I concur that printSchema works; it just seems to be operations that use the data where trouble happens. Thanks for posting the bug. -Brad On Tue, Aug 5, 2014 at 10:05 PM, Yin Huai wrote: > I tried jsonRDD(...).printSchema() and it worked. Seems the problem is > when we take the data back to

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Yin Huai
I tried jsonRDD(...).printSchema() and it worked. Seems the problem is when we take the data back to the Python side, SchemaRDD#javaToPython failed on your cases. I have created https://issues.apache.org/jira/browse/SPARK-2875 to track it. Thanks, Yin On Tue, Aug 5, 2014 at 9:20 PM, Brad Miller

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Hi All, I checked out and built master. Note that Maven had a problem building Kafka (in my case, at least); I was unable to fix this easily so I moved on since it seemed unlikely to have any influence on the problem at hand. Master improves functionality (including the example Nicholas just dem

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Michael Armbrust
We try to keep master very stable, but this is where active development happens. YMMV, but a lot of people do run very close to master without incident (myself included). branch-1.0 has been cut for a while and we only merge bug fixes into it (this is more strict for non-alpha components like spar

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Nicholas Chammas
collect() works, too. >>> sqlContext.jsonRDD(sc.parallelize(['{"foo":[[1,2,3], [4,5,6]]}', >>> '{"foo":[[1,2,3], [4,5,6]]}'])).collect() [Row(foo=[[1, 2, 3], [4, 5, 6]]), Row(foo=[[1, 2, 3], [4, 5, 6]])] Can’t answer your question about branch stability, though. Spark is a very active project, s

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Hi Nick, Can you check that the call to "collect()" works as well as "printSchema()"? I actually experience that "printSchema()" works fine, but then it crashes on "collect()". In general, should I expect the master (which seems to be on branch-1.1) to be any more/less stable than branch-1.0? W

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Nicholas Chammas
This looks to be fixed in master: >>> from pyspark.sql import SQLContext>>> sqlContext = SQLContext(sc)>>> >>> sc.parallelize(['{"foo":[[1,2,3], [4,5,6]]}', '{"foo":[[1,2,3], [4,5,6]]}']) ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:315>>> sqlContext.jsonRDD(sc.parallelize(['{"foo":

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Hi All, I've built and deployed the current head of branch-1.0, but it seems to have only partly fixed the bug. This code now runs as expected with the indicated output: > srdd = sqlCtx.jsonRDD(sc.parallelize(['{"foo":[1,2,3]}', '{"foo":[4,5,6]}'])) > srdd.printSchema() root |-- foo: ArrayType[I

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Nick: Thanks for both the original JIRA bug report and the link. Michael: This is on the 1.0.1 release. I'll update to master and follow-up if I have any problems. best, -Brad On Tue, Aug 5, 2014 at 12:04 PM, Michael Armbrust wrote: > Is this on 1.0.1? I'd suggest running this on master or

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Michael Armbrust
Is this on 1.0.1? I'd suggest running this on master or the 1.1-RC which should be coming out this week. Pyspark did not have good support for nested data previously. If you still encounter issues using a more recent version, please file a JIRA. Thanks! On Tue, Aug 5, 2014 at 11:55 AM, Brad M

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Nicholas Chammas
I believe this is a known issue in 1.0.1 that's fixed in 1.0.2. See: SPARK-2376: Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException On Tue, Aug 5, 2014 at 2:55 PM, Brad Miller wrote: > Hi All, > > I am i

trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Hi All, I am interested to use jsonRDD and jsonFile to create a SchemaRDD out of some JSON data I have, but I've run into some instability involving the following java exception: An error occurred while calling o1326.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Ta