Could you share a short script to reproduce this problem? On Tue, Feb 10, 2015 at 8:55 PM, Rok Roskar <rokros...@gmail.com> wrote: > I didn't notice other errors -- I also thought such a large broadcast is a > bad idea but I tried something similar with a much smaller dictionary and > encountered the same problem. I'm not familiar enough with spark internals > to know whether the trace indicates an issue with the broadcast variables or > perhaps something different? > > The driver and executors have 50gb of ram so memory should be fine. > > Thanks, > > Rok > > On Feb 11, 2015 12:19 AM, "Davies Liu" <dav...@databricks.com> wrote: >> >> It's brave to broadcast 8G pickled data, it will take more than 15G in >> memory for each Python worker, >> how much memory do you have in executor and driver? >> Do you see any other exceptions in driver and executors? Something >> related to serialization in JVM. >> >> On Tue, Feb 10, 2015 at 2:16 PM, Rok Roskar <rokros...@gmail.com> wrote: >> > I get this in the driver log: >> >> I think this should happen on executor, or you called first() or >> take() on the RDD? >> >> > java.lang.NullPointerException >> > at >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229) >> > at scala.collection.Iterator$class.foreach(Iterator.scala:727) >> > at >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >> > at >> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) >> > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) >> > at >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) >> > >> > and on one of the executor's stderr: >> > >> > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly >> > (crashed) >> > org.apache.spark.api.python.PythonException: Traceback (most recent call >> > last): >> > File >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", >> > line 57, in main >> > split_index = read_int(infile) >> > File >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", >> > line 511, in read_int >> > raise EOFError >> > EOFError >> > >> > at >> > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137) >> > at >> > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174) >> > at >> > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) >> > at >> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) >> > at >> > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) >> > at >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) >> > Caused by: java.lang.NullPointerException >> > at >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229) >> > at scala.collection.Iterator$class.foreach(Iterator.scala:727) >> > at >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >> > at >> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) >> > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229) >> > ... 4 more >> > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly >> > (crashed) >> > org.apache.spark.api.python.PythonException: Traceback (most recent call >> > last): >> > File >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", >> > line 57, in main >> > split_index = read_int(infile) >> > File >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", >> > line 511, in read_int >> > raise EOFError >> > EOFError >> > >> > at >> > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137) >> > at >> > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174) >> > at >> > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) >> > at >> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) >> > at >> > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) >> > at >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) >> > Caused by: java.lang.NullPointerException >> > at >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229) >> > at scala.collection.Iterator$class.foreach(Iterator.scala:727) >> > at >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >> > at >> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) >> > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) >> > at >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229) >> > ... 4 more >> > >> > >> > What I find odd is that when I make the broadcast object, the logs don't >> > show any significant amount of memory being allocated in any of the block >> > managers -- but the dictionary is large, it's 8 Gb pickled on disk. >> > >> > >> > On Feb 10, 2015, at 10:01 PM, Davies Liu <dav...@databricks.com> wrote: >> > >> >> Could you paste the NPE stack trace here? It will better to create a >> >> JIRA for it, thanks! >> >> >> >> On Tue, Feb 10, 2015 at 10:42 AM, rok <rokros...@gmail.com> wrote: >> >>> I'm trying to use a broadcasted dictionary inside a map function and >> >>> am >> >>> consistently getting Java null pointer exceptions. This is inside an >> >>> IPython >> >>> session connected to a standalone spark cluster. I seem to recall >> >>> being able >> >>> to do this before but at the moment I am at a loss as to what to try >> >>> next. >> >>> Is there a limit to the size of broadcast variables? This one is >> >>> rather >> >>> large (a few Gb dict). Thanks! >> >>> >> >>> Rok >> >>> >> >>> >> >>> >> >>> -- >> >>> View this message in context: >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Java-null-pointer-exception-when-accessing-broadcast-variables-tp21580.html >> >>> Sent from the Apache Spark User List mailing list archive at >> >>> Nabble.com. >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >>> For additional commands, e-mail: user-h...@spark.apache.org >> >>> >> >
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org