Re: pyspark: Java null pointer exception when accessing broadcast variables

Davies Liu Wed, 11 Feb 2015 11:02:24 -0800

Could you share a short script to reproduce this problem?

On Tue, Feb 10, 2015 at 8:55 PM, Rok Roskar <rokros...@gmail.com> wrote:
> I didn't notice other errors -- I also thought such a large broadcast is a
> bad idea but I tried something similar with a much smaller dictionary and
> encountered the same problem. I'm not familiar enough with spark internals
> to know whether the trace indicates an issue with the broadcast variables or
> perhaps something different?
>
> The driver and executors have 50gb of ram so memory should be fine.
>
> Thanks,
>
> Rok
>
> On Feb 11, 2015 12:19 AM, "Davies Liu" <dav...@databricks.com> wrote:
>>
>> It's brave to broadcast 8G pickled data, it will take more than 15G in
>> memory for each Python worker,
>> how much memory do you have in executor and driver?
>> Do you see any other exceptions in driver and executors? Something
>> related to serialization in JVM.
>>
>> On Tue, Feb 10, 2015 at 2:16 PM, Rok Roskar <rokros...@gmail.com> wrote:
>> > I get this in the driver log:
>>
>> I think this should happen on executor, or you called first() or
>> take() on the RDD?
>>
>> > java.lang.NullPointerException
>> >         at
>> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
>> >         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> >         at
>> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> >         at
>> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>> >         at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>> >         at
>> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
>> >
>> > and on one of the executor's stderr:
>> >
>> > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly
>> > (crashed)
>> > org.apache.spark.api.python.PythonException: Traceback (most recent call
>> > last):
>> >   File
>> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py",
>> > line 57, in main
>> >     split_index = read_int(infile)
>> >   File
>> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py",
>> > line 511, in read_int
>> >     raise EOFError
>> > EOFError
>> >
>> >         at
>> > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174)
>> >         at
>> > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>> >         at
>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>> >         at
>> > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>> >         at
>> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
>> > Caused by: java.lang.NullPointerException
>> >         at
>> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
>> >         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> >         at
>> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> >         at
>> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>> >         at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
>> >         ... 4 more
>> > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly
>> > (crashed)
>> > org.apache.spark.api.python.PythonException: Traceback (most recent call
>> > last):
>> >   File
>> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py",
>> > line 57, in main
>> >     split_index = read_int(infile)
>> >   File
>> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py",
>> > line 511, in read_int
>> >     raise EOFError
>> > EOFError
>> >
>> >         at
>> > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174)
>> >         at
>> > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>> >         at
>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>> >         at
>> > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>> >         at
>> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
>> > Caused by: java.lang.NullPointerException
>> >         at
>> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
>> >         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> >         at
>> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> >         at
>> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>> >         at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>> >         at
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
>> >         ... 4 more
>> >
>> >
>> > What I find odd is that when I make the broadcast object, the logs don't
>> > show any significant amount of memory being allocated in any of the block
>> > managers -- but the dictionary is large, it's 8 Gb pickled on disk.
>> >
>> >
>> > On Feb 10, 2015, at 10:01 PM, Davies Liu <dav...@databricks.com> wrote:
>> >
>> >> Could you paste the NPE stack trace here? It will better to create a
>> >> JIRA for it, thanks!
>> >>
>> >> On Tue, Feb 10, 2015 at 10:42 AM, rok <rokros...@gmail.com> wrote:
>> >>> I'm trying to use a broadcasted dictionary inside a map function and
>> >>> am
>> >>> consistently getting Java null pointer exceptions. This is inside an
>> >>> IPython
>> >>> session connected to a standalone spark cluster. I seem to recall
>> >>> being able
>> >>> to do this before but at the moment I am at a loss as to what to try
>> >>> next.
>> >>> Is there a limit to the size of broadcast variables? This one is
>> >>> rather
>> >>> large (a few Gb dict). Thanks!
>> >>>
>> >>> Rok
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Java-null-pointer-exception-when-accessing-broadcast-variables-tp21580.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>>
>> >


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: pyspark: Java null pointer exception when accessing broadcast variables

Reply via email to