I think the problem was related to the broadcasts being too large -- I've now split it up into many smaller operations but it's still not quite there -- see http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-td21606.html
Thanks, Rok On Wed, Feb 11, 2015, 19:59 Davies Liu <dav...@databricks.com> wrote: > Could you share a short script to reproduce this problem? > > On Tue, Feb 10, 2015 at 8:55 PM, Rok Roskar <rokros...@gmail.com> wrote: > > I didn't notice other errors -- I also thought such a large broadcast is > a > > bad idea but I tried something similar with a much smaller dictionary and > > encountered the same problem. I'm not familiar enough with spark > internals > > to know whether the trace indicates an issue with the broadcast > variables or > > perhaps something different? > > > > The driver and executors have 50gb of ram so memory should be fine. > > > > Thanks, > > > > Rok > > > > On Feb 11, 2015 12:19 AM, "Davies Liu" <dav...@databricks.com> wrote: > >> > >> It's brave to broadcast 8G pickled data, it will take more than 15G in > >> memory for each Python worker, > >> how much memory do you have in executor and driver? > >> Do you see any other exceptions in driver and executors? Something > >> related to serialization in JVM. > >> > >> On Tue, Feb 10, 2015 at 2:16 PM, Rok Roskar <rokros...@gmail.com> > wrote: > >> > I get this in the driver log: > >> > >> I think this should happen on executor, or you called first() or > >> take() on the RDD? > >> > >> > java.lang.NullPointerException > >> > at > >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229) > >> > at scala.collection.Iterator$class.foreach(Iterator.scala: > 727) > >> > at > >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > >> > at > >> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > >> > at scala.collection.AbstractIterable.foreach( > Iterable.scala:54) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1.apply(PythonRDD.scala:204) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1.apply(PythonRDD.scala:204) > >> > at > >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread.run( > PythonRDD.scala:203) > >> > > >> > and on one of the executor's stderr: > >> > > >> > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly > >> > (crashed) > >> > org.apache.spark.api.python.PythonException: Traceback (most recent > call > >> > last): > >> > File > >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/ > pyspark/worker.py", > >> > line 57, in main > >> > split_index = read_int(infile) > >> > File > >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/ > pyspark/serializers.py", > >> > line 511, in read_int > >> > raise EOFError > >> > EOFError > >> > > >> > at > >> > org.apache.spark.api.python.PythonRDD$$anon$1.read( > PythonRDD.scala:137) > >> > at > >> > org.apache.spark.api.python.PythonRDD$$anon$1.<init>( > PythonRDD.scala:174) > >> > at > >> > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) > >> > at > >> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > >> > at > >> > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1.apply(PythonRDD.scala:204) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1.apply(PythonRDD.scala:204) > >> > at > >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread.run( > PythonRDD.scala:203) > >> > Caused by: java.lang.NullPointerException > >> > at > >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229) > >> > at scala.collection.Iterator$class.foreach(Iterator.scala: > 727) > >> > at > >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > >> > at > >> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > >> > at scala.collection.AbstractIterable.foreach( > Iterable.scala:54) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229) > >> > ... 4 more > >> > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly > >> > (crashed) > >> > org.apache.spark.api.python.PythonException: Traceback (most recent > call > >> > last): > >> > File > >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/ > pyspark/worker.py", > >> > line 57, in main > >> > split_index = read_int(infile) > >> > File > >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/ > pyspark/serializers.py", > >> > line 511, in read_int > >> > raise EOFError > >> > EOFError > >> > > >> > at > >> > org.apache.spark.api.python.PythonRDD$$anon$1.read( > PythonRDD.scala:137) > >> > at > >> > org.apache.spark.api.python.PythonRDD$$anon$1.<init>( > PythonRDD.scala:174) > >> > at > >> > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) > >> > at > >> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > >> > at > >> > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1.apply(PythonRDD.scala:204) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1.apply(PythonRDD.scala:204) > >> > at > >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread.run( > PythonRDD.scala:203) > >> > Caused by: java.lang.NullPointerException > >> > at > >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229) > >> > at scala.collection.Iterator$class.foreach(Iterator.scala: > 727) > >> > at > >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > >> > at > >> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > >> > at scala.collection.AbstractIterable.foreach( > Iterable.scala:54) > >> > at > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$ > anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229) > >> > ... 4 more > >> > > >> > > >> > What I find odd is that when I make the broadcast object, the logs > don't > >> > show any significant amount of memory being allocated in any of the > block > >> > managers -- but the dictionary is large, it's 8 Gb pickled on disk. > >> > > >> > > >> > On Feb 10, 2015, at 10:01 PM, Davies Liu <dav...@databricks.com> > wrote: > >> > > >> >> Could you paste the NPE stack trace here? It will better to create a > >> >> JIRA for it, thanks! > >> >> > >> >> On Tue, Feb 10, 2015 at 10:42 AM, rok <rokros...@gmail.com> wrote: > >> >>> I'm trying to use a broadcasted dictionary inside a map function and > >> >>> am > >> >>> consistently getting Java null pointer exceptions. This is inside an > >> >>> IPython > >> >>> session connected to a standalone spark cluster. I seem to recall > >> >>> being able > >> >>> to do this before but at the moment I am at a loss as to what to try > >> >>> next. > >> >>> Is there a limit to the size of broadcast variables? This one is > >> >>> rather > >> >>> large (a few Gb dict). Thanks! > >> >>> > >> >>> Rok > >> >>> > >> >>> > >> >>> > >> >>> -- > >> >>> View this message in context: > >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark- > Java-null-pointer-exception-when-accessing-broadcast- > variables-tp21580.html > >> >>> Sent from the Apache Spark User List mailing list archive at > >> >>> Nabble.com. > >> >>> > >> >>> ------------------------------------------------------------ > --------- > >> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> >>> For additional commands, e-mail: user-h...@spark.apache.org > >> >>> > >> > >