I think this is due to the json file format. DataFrame can only accept json file with one valid record per line. Multiple line per record is invalid for DataFrame.
On Tue, Oct 6, 2015 at 2:48 AM, Davies Liu <[email protected]> wrote: > Could you create a JIRA to track this bug? > > On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan > <[email protected]> wrote: > > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1. > > > > I'm trying to read in a large quantity of json data in a couple of files > and > > I receive a scala.MatchError when I do so. Json, Python and stack trace > all > > shown below. > > > > Json: > > > > { > > "dataunit": { > > "page_view": { > > "nonce": 438058072, > > "person": { > > "user_id": 5846 > > }, > > "page": { > > "url": "http://mysite.com/blog" > > } > > } > > }, > > "pedigree": { > > "true_as_of_secs": 1438627992 > > } > > } > > > > Python: > > > > import pyspark > > sc = pyspark.SparkContext() > > sqlContext = pyspark.SQLContext(sc) > > pageviews = sqlContext.read.json("[Path to folder containing file with > above > > json]") > > pageviews.collect() > > > > Stack Trace: > > Py4JJavaError: An error occurred while calling > > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > > : org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 > > in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage > > 32.0 (TID 133, localhost): scala.MatchError: > > (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2) > > at > > > org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49) > > at > > > org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201) > > at > > > org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193) > > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > at > > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116) > > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at > > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111) > > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > > at > > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > > at scala.collection.TraversableOnce$class.to > (TraversableOnce.scala:273) > > at > > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111) > > at > > > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > > at > > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111) > > at > > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > > at > > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111) > > at > > > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) > > at > > > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) > > at > > > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) > > at > > > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) > > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > > at org.apache.spark.scheduler.Task.run(Task.scala:70) > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > > > > Driver stacktrace: > > at > > org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273) > > at > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264) > > at > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263) > > at > > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > at > > > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263) > > at > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) > > at > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) > > at scala.Option.foreach(Option.scala:236) > > at > > > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730) > > at > > > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457) > > at > > > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418) > > at > org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > > > > It seems like this issue has been resolved in scala per SPARK-3390 > > <https://issues.apache.org/jira/browse/SPARK-3390> ; any thoughts on > the > > root cause of this in pyspark? > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Reading-JSON-in-Pyspark-throws-scala-MatchError-tp24911.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Best Regards Jeff Zhang
