Hi Durin,

I guess that blank lines caused the problem (like Aaron said). Right now,
jsonFile does not skip faulty lines. Can you first use sc.textfile to load
the file as RDD[String] and then use filter to filter out those blank lines
(code snippet can be found below)?

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rdd = sc.textFile("hdfs://host:9100/user/myuser/data.json").filter(r =>
r.trim != "")
val table = sqlContext.jsonRDD(rdd)
table.printSchema()

Thanks,

Yin



On Wed, Jun 25, 2014 at 1:08 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> Is it possible you have blank lines in your input? Not that this should be
> an error condition, but it may be what's causing it.
>
>
> On Wed, Jun 25, 2014 at 11:57 AM, durin <m...@simon-schaefer.net> wrote:
>
>> Hi Zongheng Yang,
>>
>> thanks for your response. Reading your answer, I did some more tests and
>> realized that analyzing very small parts of the dataset (which is ~130GB
>> in
>> ~4.3M lines) works fine.
>> The error occurs when I analyze larger parts. Using 5% of the whole data,
>> the error is the same as posted before for certain TIDs. However, I get
>> the
>> structure determined so far as a result when using 5%.
>>
>> The Spark WebUI shows the following:
>>
>> Job aborted due to stage failure: Task 6.0:11 failed 4 times, most recent
>> failure: Exception failure in TID 108 on host foo.bar.com:
>> com.fasterxml.jackson.databind.JsonMappingException: No content to map due
>> to end-of-input at [Source: java.io.StringReader@3697781f; line: 1,
>> column:
>> 1]
>>
>> com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:164)
>>
>> com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3029)
>>
>> com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2971)
>>
>> com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2091)
>>
>> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:261)
>>
>> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:261)
>> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>
>> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:172)
>> scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
>> org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:823)
>> org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:821)
>> org.apache.spark.SparkContext$$anonfun$24.apply(SparkContext.scala:1132)
>> org.apache.spark.SparkContext$$anonfun$24.apply(SparkContext.scala:1132)
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)
>> org.apache.spark.scheduler.Task.run(Task.scala:51)
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> java.lang.Thread.run(Thread.java:662) Driver stacktrace:
>>
>>
>>
>> Is the only possible reason that some of these 4.3 Million JSON-Objects
>> are
>> not valid JSON, or could there be another explanation?
>> And if it is the reason, is there some way to tell the function to just
>> skip
>> faulty lines?
>>
>>
>> Thanks,
>> Durin
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/jsonFile-function-in-SQLContext-does-not-work-tp8273p8278.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Reply via email to