It's brave to broadcast 8G pickled data, it will take more than 15G in
memory for each Python worker,
how much memory do you have in executor and driver?
Do you see any other exceptions in driver and executors? Something
related to serialization in JVM.

On Tue, Feb 10, 2015 at 2:16 PM, Rok Roskar <rokros...@gmail.com> wrote:
> I get this in the driver log:

I think this should happen on executor, or you called first() or
take() on the RDD?

> java.lang.NullPointerException
>         at 
> org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>         at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>         at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
>
> and on one of the executor's stderr:
>
> 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", 
> line 57, in main
>     split_index = read_int(infile)
>   File 
> "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py",
>  line 511, in read_int
>     raise EOFError
> EOFError
>
>         at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
>         at 
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174)
>         at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>         at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>         at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>         at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
>         ... 4 more
> 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", 
> line 57, in main
>     split_index = read_int(infile)
>   File 
> "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py",
>  line 511, in read_int
>     raise EOFError
> EOFError
>
>         at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
>         at 
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174)
>         at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>         at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>         at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>         at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>         at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
>         ... 4 more
>
>
> What I find odd is that when I make the broadcast object, the logs don't show 
> any significant amount of memory being allocated in any of the block managers 
> -- but the dictionary is large, it's 8 Gb pickled on disk.
>
>
> On Feb 10, 2015, at 10:01 PM, Davies Liu <dav...@databricks.com> wrote:
>
>> Could you paste the NPE stack trace here? It will better to create a
>> JIRA for it, thanks!
>>
>> On Tue, Feb 10, 2015 at 10:42 AM, rok <rokros...@gmail.com> wrote:
>>> I'm trying to use a broadcasted dictionary inside a map function and am
>>> consistently getting Java null pointer exceptions. This is inside an IPython
>>> session connected to a standalone spark cluster. I seem to recall being able
>>> to do this before but at the moment I am at a loss as to what to try next.
>>> Is there a limit to the size of broadcast variables? This one is rather
>>> large (a few Gb dict). Thanks!
>>>
>>> Rok
>>>
>>>
>>>
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Java-null-pointer-exception-when-accessing-broadcast-variables-tp21580.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to