We try to keep master very stable, but this is where active development happens. YMMV, but a lot of people do run very close to master without incident (myself included).
branch-1.0 has been cut for a while and we only merge bug fixes into it (this is more strict for non-alpha components like spark core.). For Spark SQL, this branch is pretty far behind as the project is very young and we are fixing bugs / adding features very rapidly compared with Spark core. branch-1.1 was just cut and is being QAed for a release, at this point its likely the same as master, but that will change as features start getting added to master in the coming weeks. On Tue, Aug 5, 2014 at 5:38 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > collect() works, too. > > >>> sqlContext.jsonRDD(sc.parallelize(['{"foo":[[1,2,3], [4,5,6]]}', > >>> '{"foo":[[1,2,3], [4,5,6]]}'])).collect() > [Row(foo=[[1, 2, 3], [4, 5, 6]]), Row(foo=[[1, 2, 3], [4, 5, 6]])] > > Can’t answer your question about branch stability, though. Spark is a very > active project, so stuff is happening all the time. > > Nick > > > > On Tue, Aug 5, 2014 at 7:20 PM, Brad Miller <bmill...@eecs.berkeley.edu> > wrote: > >> Hi Nick, >> >> Can you check that the call to "collect()" works as well as >> "printSchema()"? I actually experience that "printSchema()" works fine, >> but then it crashes on "collect()". >> >> In general, should I expect the master (which seems to be on branch-1.1) >> to be any more/less stable than branch-1.0? While it would be great to >> have this fixed, it would be good to know if I should expect lots of other >> instability. >> >> best, >> -Brad >> >> >> On Tue, Aug 5, 2014 at 4:15 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> This looks to be fixed in master: >>> >>> >>> from pyspark.sql import SQLContext>>> sqlContext = SQLContext(sc) >>> >>> sc.parallelize(['{"foo":[[1,2,3], [4,5,6]]}', '{"foo":[[1,2,3], >>> >>> [4,5,6]]}' >>> >>> >>> >>> >>> ]) >>> ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:315>>> >>> sqlContext.jsonRDD(sc.parallelize(['{"foo":[[1,2,3], [4,5,6]]}', >>> '{"foo":[[1,2,3], [4,5,6]]}'])) >>> MapPartitionsRDD[14] at mapPartitions at SchemaRDD.scala:408>>> >>> sqlContext.jsonRDD(sc.parallelize(['{"foo":[[1,2,3], [4,5,6]]}', >>> '{"foo":[[1,2,3], [4,5,6]]}'])).printSchema() >>> root >>> |-- foo: array (nullable = true) >>> | |-- element: array (containsNull = false) >>> | | |-- element: integer (containsNull = false) >>> >>> >>> >>> >>> Nick >>> >>> >>> >>> On Tue, Aug 5, 2014 at 7:12 PM, Brad Miller <bmill...@eecs.berkeley.edu> >>> wrote: >>> >>>> Hi All, >>>> >>>> I've built and deployed the current head of branch-1.0, but it seems to >>>> have only partly fixed the bug. >>>> >>>> This code now runs as expected with the indicated output: >>>> > srdd = sqlCtx.jsonRDD(sc.parallelize(['{"foo":[1,2,3]}', >>>> '{"foo":[4,5,6]}'])) >>>> > srdd.printSchema() >>>> root >>>> |-- foo: ArrayType[IntegerType] >>>> > srdd.collect() >>>> [{u'foo': [1, 2, 3]}, {u'foo': [4, 5, 6]}] >>>> >>>> This code still crashes: >>>> > srdd = sqlCtx.jsonRDD(sc.parallelize(['{"foo":[[1,2,3], [4,5,6]]}', >>>> '{"foo":[[1,2,3], [4,5,6]]}'])) >>>> > srdd.printSchema() >>>> root >>>> |-- foo: ArrayType[ArrayType(IntegerType)] >>>> > srdd.collect() >>>> Py4JJavaError: An error occurred while calling o63.collect. >>>> : org.apache.spark.SparkException: Job aborted due to stage failure: >>>> Task 3.0:29 failed 4 times, most recent failure: Exception failure in TID >>>> 67 on host kunitz.research.intel-research.net: >>>> net.razorvine.pickle.PickleException: couldn't introspect javabean: >>>> java.lang.IllegalArgumentException: wrong number of arguments >>>> >>>> I may be able to see if this is fixed in master, but since it's not >>>> fixed in 1.0.3 it seems unlikely to be fixed in master either. I previously >>>> tried master as well, but ran into a build problem that did not occur with >>>> the 1.0 branch. >>>> >>>> Can anybody else verify that the second example still crashes (and is >>>> meant to work)? If so, would it be best to modify JIRA-2376 or start a new >>>> bug? >>>> https://issues.apache.org/jira/browse/SPARK-2376 >>>> >>>> best, >>>> -Brad >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Aug 5, 2014 at 12:10 PM, Brad Miller < >>>> bmill...@eecs.berkeley.edu> wrote: >>>> >>>>> Nick: Thanks for both the original JIRA bug report and the link. >>>>> >>>>> Michael: This is on the 1.0.1 release. I'll update to master and >>>>> follow-up if I have any problems. >>>>> >>>>> best, >>>>> -Brad >>>>> >>>>> >>>>> On Tue, Aug 5, 2014 at 12:04 PM, Michael Armbrust < >>>>> mich...@databricks.com> wrote: >>>>> >>>>>> Is this on 1.0.1? I'd suggest running this on master or the 1.1-RC >>>>>> which should be coming out this week. Pyspark did not have good support >>>>>> for nested data previously. If you still encounter issues using a more >>>>>> recent version, please file a JIRA. Thanks! >>>>>> >>>>>> >>>>>> On Tue, Aug 5, 2014 at 11:55 AM, Brad Miller < >>>>>> bmill...@eecs.berkeley.edu> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I am interested to use jsonRDD and jsonFile to create a SchemaRDD >>>>>>> out of some JSON data I have, but I've run into some instability >>>>>>> involving >>>>>>> the following java exception: >>>>>>> >>>>>>> An error occurred while calling o1326.collect. >>>>>>> : org.apache.spark.SparkException: Job aborted due to stage failure: >>>>>>> Task 181.0:29 failed 4 times, most recent failure: Exception failure in >>>>>>> TID >>>>>>> 1664 on host neal.research.intel-research.net: >>>>>>> net.razorvine.pickle.PickleException: couldn't introspect javabean: >>>>>>> java.lang.IllegalArgumentException: wrong number of arguments >>>>>>> >>>>>>> I've pasted code which produces the error as well as the full >>>>>>> traceback below. Note that I don't have any problem when I parse the >>>>>>> JSON >>>>>>> myself and use inferSchema. >>>>>>> >>>>>>> Is anybody able to reproduce this bug? >>>>>>> >>>>>>> -Brad >>>>>>> >>>>>>> > srdd = sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar", >>>>>>> "baz":[1,2,3]}', '{"foo":"boom", "baz":[1,2,3]}'])) >>>>>>> > srdd.printSchema() >>>>>>> >>>>>>> root >>>>>>> |-- baz: ArrayType[IntegerType] >>>>>>> |-- foo: StringType >>>>>>> >>>>>>> > srdd.collect() >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------------- >>>>>>> Py4JJavaError Traceback (most recent >>>>>>> call last) >>>>>>> <ipython-input-89-ec7e8e8c68c4> in <module>() >>>>>>> ----> 1 srdd.collect() >>>>>>> >>>>>>> /home/spark/spark-1.0.1-bin-hadoop1/python/pyspark/rdd.py in >>>>>>> collect(self) >>>>>>> 581 """ >>>>>>> 582 with _JavaStackTrace(self.context) as st: >>>>>>> --> 583 bytesInJava = self._jrdd.collect().iterator() >>>>>>> 584 return >>>>>>> list(self._collect_iterator_through_file(bytesInJava)) >>>>>>> 585 >>>>>>> >>>>>>> /usr/local/lib/python2.7/dist-packages/py4j/java_gateway.pyc in >>>>>>> __call__(self, *args) >>>>>>> 535 answer = self.gateway_client.send_command(command) >>>>>>> 536 return_value = get_return_value(answer, >>>>>>> self.gateway_client, >>>>>>> --> 537 self.target_id, self.name) >>>>>>> 538 >>>>>>> 539 for temp_arg in temp_args: >>>>>>> >>>>>>> /usr/local/lib/python2.7/dist-packages/py4j/protocol.pyc in >>>>>>> get_return_value(answer, gateway_client, target_id, name) >>>>>>> 298 raise Py4JJavaError( >>>>>>> 299 'An error occurred while calling >>>>>>> {0}{1}{2}.\n'. >>>>>>> --> 300 format(target_id, '.', name), value) >>>>>>> 301 else: >>>>>>> 302 raise Py4JError( >>>>>>> >>>>>>> Py4JJavaError: An error occurred while calling o1326.collect. >>>>>>> : org.apache.spark.SparkException: Job aborted due to stage failure: >>>>>>> Task 181.0:29 failed 4 times, most recent failure: Exception failure in >>>>>>> TID >>>>>>> 1664 on host neal.research.intel-research.net: >>>>>>> net.razorvine.pickle.PickleException: couldn't introspect javabean: >>>>>>> java.lang.IllegalArgumentException: wrong number of arguments >>>>>>> net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) >>>>>>> net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) >>>>>>> net.razorvine.pickle.Pickler.save(Pickler.java:125) >>>>>>> net.razorvine.pickle.Pickler.put_map(Pickler.java:322) >>>>>>> net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) >>>>>>> net.razorvine.pickle.Pickler.save(Pickler.java:125) >>>>>>> >>>>>>> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392) >>>>>>> net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) >>>>>>> net.razorvine.pickle.Pickler.save(Pickler.java:125) >>>>>>> net.razorvine.pickle.Pickler.dump(Pickler.java:95) >>>>>>> net.razorvine.pickle.Pickler.dumps(Pickler.java:80) >>>>>>> >>>>>>> org.apache.spark.sql.SchemaRDD$anonfun$javaToPython$1$anonfun$apply$3.apply(SchemaRDD.scala:385) >>>>>>> >>>>>>> org.apache.spark.sql.SchemaRDD$anonfun$javaToPython$1$anonfun$apply$3.apply(SchemaRDD.scala:385) >>>>>>> scala.collection.Iterator$anon$11.next(Iterator.scala:328) >>>>>>> >>>>>>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:294) >>>>>>> >>>>>>> org.apache.spark.api.python.PythonRDD$WriterThread$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) >>>>>>> >>>>>>> org.apache.spark.api.python.PythonRDD$WriterThread$anonfun$run$1.apply(PythonRDD.scala:175) >>>>>>> >>>>>>> org.apache.spark.api.python.PythonRDD$WriterThread$anonfun$run$1.apply(PythonRDD.scala:175) >>>>>>> >>>>>>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) >>>>>>> >>>>>>> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) >>>>>>> Driver stacktrace: >>>>>>> at org.apache.spark.scheduler.DAGScheduler.org >>>>>>> $apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1044) >>>>>>> at >>>>>>> org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) >>>>>>> at >>>>>>> org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) >>>>>>> at >>>>>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >>>>>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >>>>>>> at >>>>>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) >>>>>>> at >>>>>>> org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) >>>>>>> at >>>>>>> org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) >>>>>>> at scala.Option.foreach(Option.scala:236) >>>>>>> at >>>>>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) >>>>>>> at >>>>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) >>>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) >>>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:456) >>>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) >>>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:219) >>>>>>> at >>>>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) >>>>>>> at >>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>>>>> at >>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>>>>> at >>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>>>>> at >>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >