Hi Konstantin,

Thanks for reporting this. This happens because there are null keys in your
data. In general, Spark should not throw null pointer exceptions, so this
is a bug. I have fixed this here: https://github.com/apache/spark/pull/1288.

For now, you can workaround this by special-handling your null keys before
passing your key value pairs to a combine operator (e.g. groupBy,
reduceBy). For instance, rdd.map { case (k, v) => if (k == null)
(SPECIAL_VALUE, v) else (k, v) }.

Best,
Andrew




2014-07-02 10:22 GMT-07:00 Konstantin Kudryavtsev <
kudryavtsev.konstan...@gmail.com>:

> Hi all,
>
> I catch very confusing exception running Spark 1.0 on HDP2.1
> During save rdd as text file I got:
>
>
> 14/07/02 10:11:12 WARN TaskSetManager: Loss was due to 
> java.lang.NullPointerException
> java.lang.NullPointerException
>       at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.org$apache$spark$util$collection$ExternalAppendOnlyMap$ExternalIterator$$getMorePairs(ExternalAppendOnlyMap.scala:254)
>       at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$3.apply(ExternalAppendOnlyMap.scala:237)
>       at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$3.apply(ExternalAppendOnlyMap.scala:236)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.<init>(ExternalAppendOnlyMap.scala:236)
>       at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:218)
>       at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:162)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>       at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>       at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>       at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>       at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>       at org.apache.spark.scheduler.Task.run(Task.scala:51)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
>
>
> Do you have any idea what is it? how can I debug this issue or perhaps
> access another log?
>
>
> Thank you,
> Konstantin Kudryavtsev
>

Reply via email to