I think the problem was related to the broadcasts being too large -- I've
now split it up into many smaller operations but it's still not quite there
-- see
http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-td21606.html

Thanks,

Rok

On Wed, Feb 11, 2015, 19:59 Davies Liu <dav...@databricks.com> wrote:

> Could you share a short script to reproduce this problem?
>
> On Tue, Feb 10, 2015 at 8:55 PM, Rok Roskar <rokros...@gmail.com> wrote:
> > I didn't notice other errors -- I also thought such a large broadcast is
> a
> > bad idea but I tried something similar with a much smaller dictionary and
> > encountered the same problem. I'm not familiar enough with spark
> internals
> > to know whether the trace indicates an issue with the broadcast
> variables or
> > perhaps something different?
> >
> > The driver and executors have 50gb of ram so memory should be fine.
> >
> > Thanks,
> >
> > Rok
> >
> > On Feb 11, 2015 12:19 AM, "Davies Liu" <dav...@databricks.com> wrote:
> >>
> >> It's brave to broadcast 8G pickled data, it will take more than 15G in
> >> memory for each Python worker,
> >> how much memory do you have in executor and driver?
> >> Do you see any other exceptions in driver and executors? Something
> >> related to serialization in JVM.
> >>
> >> On Tue, Feb 10, 2015 at 2:16 PM, Rok Roskar <rokros...@gmail.com>
> wrote:
> >> > I get this in the driver log:
> >>
> >> I think this should happen on executor, or you called first() or
> >> take() on the RDD?
> >>
> >> > java.lang.NullPointerException
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
> >> >         at scala.collection.Iterator$class.foreach(Iterator.scala:
> 727)
> >> >         at
> >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> >> >         at
> >> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> >> >         at scala.collection.AbstractIterable.foreach(
> Iterable.scala:54)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply(PythonRDD.scala:204)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply(PythonRDD.scala:204)
> >> >         at
> >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread.run(
> PythonRDD.scala:203)
> >> >
> >> > and on one of the executor's stderr:
> >> >
> >> > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly
> >> > (crashed)
> >> > org.apache.spark.api.python.PythonException: Traceback (most recent
> call
> >> > last):
> >> >   File
> >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/
> pyspark/worker.py",
> >> > line 57, in main
> >> >     split_index = read_int(infile)
> >> >   File
> >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/
> pyspark/serializers.py",
> >> > line 511, in read_int
> >> >     raise EOFError
> >> > EOFError
> >> >
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$$anon$1.read(
> PythonRDD.scala:137)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(
> PythonRDD.scala:174)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
> >> >         at
> >> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> >> >         at
> >> > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> >> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply(PythonRDD.scala:204)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply(PythonRDD.scala:204)
> >> >         at
> >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread.run(
> PythonRDD.scala:203)
> >> > Caused by: java.lang.NullPointerException
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
> >> >         at scala.collection.Iterator$class.foreach(Iterator.scala:
> 727)
> >> >         at
> >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> >> >         at
> >> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> >> >         at scala.collection.AbstractIterable.foreach(
> Iterable.scala:54)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
> >> >         ... 4 more
> >> > 15/02/10 23:10:35 ERROR PythonRDD: Python worker exited unexpectedly
> >> > (crashed)
> >> > org.apache.spark.api.python.PythonException: Traceback (most recent
> call
> >> > last):
> >> >   File
> >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/
> pyspark/worker.py",
> >> > line 57, in main
> >> >     split_index = read_int(infile)
> >> >   File
> >> > "/cluster/home/roskarr/spark-1.2.0-bin-hadoop2.4/python/
> pyspark/serializers.py",
> >> > line 511, in read_int
> >> >     raise EOFError
> >> > EOFError
> >> >
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$$anon$1.read(
> PythonRDD.scala:137)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(
> PythonRDD.scala:174)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
> >> >         at
> >> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> >> >         at
> >> > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> >> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply(PythonRDD.scala:204)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply(PythonRDD.scala:204)
> >> >         at
> >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread.run(
> PythonRDD.scala:203)
> >> > Caused by: java.lang.NullPointerException
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
> >> >         at scala.collection.Iterator$class.foreach(Iterator.scala:
> 727)
> >> >         at
> >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> >> >         at
> >> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> >> >         at scala.collection.AbstractIterable.foreach(
> Iterable.scala:54)
> >> >         at
> >> > org.apache.spark.api.python.PythonRDD$WriterThread$$
> anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
> >> >         ... 4 more
> >> >
> >> >
> >> > What I find odd is that when I make the broadcast object, the logs
> don't
> >> > show any significant amount of memory being allocated in any of the
> block
> >> > managers -- but the dictionary is large, it's 8 Gb pickled on disk.
> >> >
> >> >
> >> > On Feb 10, 2015, at 10:01 PM, Davies Liu <dav...@databricks.com>
> wrote:
> >> >
> >> >> Could you paste the NPE stack trace here? It will better to create a
> >> >> JIRA for it, thanks!
> >> >>
> >> >> On Tue, Feb 10, 2015 at 10:42 AM, rok <rokros...@gmail.com> wrote:
> >> >>> I'm trying to use a broadcasted dictionary inside a map function and
> >> >>> am
> >> >>> consistently getting Java null pointer exceptions. This is inside an
> >> >>> IPython
> >> >>> session connected to a standalone spark cluster. I seem to recall
> >> >>> being able
> >> >>> to do this before but at the moment I am at a loss as to what to try
> >> >>> next.
> >> >>> Is there a limit to the size of broadcast variables? This one is
> >> >>> rather
> >> >>> large (a few Gb dict). Thanks!
> >> >>>
> >> >>> Rok
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> View this message in context:
> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-
> Java-null-pointer-exception-when-accessing-broadcast-
> variables-tp21580.html
> >> >>> Sent from the Apache Spark User List mailing list archive at
> >> >>> Nabble.com.
> >> >>>
> >> >>> ------------------------------------------------------------
> ---------
> >> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >> >>>
> >> >
>

Reply via email to