Hi Durin, I guess that blank lines caused the problem (like Aaron said). Right now, jsonFile does not skip faulty lines. Can you first use sc.textfile to load the file as RDD[String] and then use filter to filter out those blank lines (code snippet can be found below)?
val sqlContext = new org.apache.spark.sql.SQLContext(sc) val rdd = sc.textFile("hdfs://host:9100/user/myuser/data.json").filter(r => r.trim != "") val table = sqlContext.jsonRDD(rdd) table.printSchema() Thanks, Yin On Wed, Jun 25, 2014 at 1:08 PM, Aaron Davidson <ilike...@gmail.com> wrote: > Is it possible you have blank lines in your input? Not that this should be > an error condition, but it may be what's causing it. > > > On Wed, Jun 25, 2014 at 11:57 AM, durin <m...@simon-schaefer.net> wrote: > >> Hi Zongheng Yang, >> >> thanks for your response. Reading your answer, I did some more tests and >> realized that analyzing very small parts of the dataset (which is ~130GB >> in >> ~4.3M lines) works fine. >> The error occurs when I analyze larger parts. Using 5% of the whole data, >> the error is the same as posted before for certain TIDs. However, I get >> the >> structure determined so far as a result when using 5%. >> >> The Spark WebUI shows the following: >> >> Job aborted due to stage failure: Task 6.0:11 failed 4 times, most recent >> failure: Exception failure in TID 108 on host foo.bar.com: >> com.fasterxml.jackson.databind.JsonMappingException: No content to map due >> to end-of-input at [Source: java.io.StringReader@3697781f; line: 1, >> column: >> 1] >> >> com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:164) >> >> com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3029) >> >> com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2971) >> >> com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2091) >> >> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:261) >> >> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:261) >> scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >> scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >> scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >> scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >> scala.collection.Iterator$class.foreach(Iterator.scala:727) >> scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >> >> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:172) >> scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157) >> org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:823) >> org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:821) >> org.apache.spark.SparkContext$$anonfun$24.apply(SparkContext.scala:1132) >> org.apache.spark.SparkContext$$anonfun$24.apply(SparkContext.scala:1132) >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112) >> org.apache.spark.scheduler.Task.run(Task.scala:51) >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) >> >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> java.lang.Thread.run(Thread.java:662) Driver stacktrace: >> >> >> >> Is the only possible reason that some of these 4.3 Million JSON-Objects >> are >> not valid JSON, or could there be another explanation? >> And if it is the reason, is there some way to tell the function to just >> skip >> faulty lines? >> >> >> Thanks, >> Durin >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/jsonFile-function-in-SQLContext-does-not-work-tp8273p8278.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >