Re: Spark shell and StackOverFlowError

Sean Owen Mon, 31 Aug 2015 00:41:46 -0700

Yeah I see that now. I think it fails immediately because the map
operation does try to clean and/or verify the serialization of the
closure upfront.


I'm not quite sure what is going on, but I think it's some strange
interaction between how you're building up the list and what the
resulting representation happens to be like, and how the closure
cleaner works, which can't be perfect. The shell also introduces an
extra layer of issues.

For example, the slightly more canonical approaches work fine:

import scala.collection.mutable.MutableList
val lst = MutableList[(String,String,Double)]()
(0 to 10000).foreach(i => lst :+ ("10", "10", i.toDouble))

or just

val lst = (0 to 10000).map(i => ("10", "10", i.toDouble))

If you just need this to work, maybe those are better alternatives anyway.
You can also check whether it works without the shell, as I suspect
that's a factor.

It's not an error in Spark per se but saying that something's default
Java serialization graph is very deep, so it's like the code you wrote
plus the closure cleaner ends up pulling in some huge linked list and
serializing it the direct and unuseful way.

If you have an idea about exactly why it's happening you can open a
JIRA, but arguably it's something that's nice to just work but isn't
to do with Spark per se. Or, have a look at others related to the
closure and shell and you may find this is related to other known
behavior.


On Sun, Aug 30, 2015 at 8:08 PM, Ashish Shrowty
<ashish.shro...@gmail.com> wrote:
> Sean .. does the code below work for you in the Spark shell? Ted got the
> same error -
>
> val a=10
> val lst = MutableList[(String,String,Double)]()
> Range(0,10000).foreach(i=>lst+=(("10","10",i:Double)))
> sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>
> -Ashish
>
>
> On Sun, Aug 30, 2015 at 2:52 PM Sean Owen <so...@cloudera.com> wrote:
>>
>> I'm not sure how to reproduce it? this code does not produce an error in
>> master.
>>
>> On Sun, Aug 30, 2015 at 7:26 PM, Ashish Shrowty
>> <ashish.shro...@gmail.com> wrote:
>> > Do you think I should create a JIRA?
>> >
>> >
>> > On Sun, Aug 30, 2015 at 12:56 PM Ted Yu <yuzhih...@gmail.com> wrote:
>> >>
>> >> I got StackOverFlowError as well :-(
>> >>
>> >> On Sun, Aug 30, 2015 at 9:47 AM, Ashish Shrowty
>> >> <ashish.shro...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Yep .. I tried that too earlier. Doesn't make a difference. Are you
>> >>> able
>> >>> to replicate on your side?
>> >>>
>> >>>
>> >>> On Sun, Aug 30, 2015 at 12:08 PM Ted Yu <yuzhih...@gmail.com> wrote:
>> >>>>
>> >>>> I see.
>> >>>>
>> >>>> What about using the following in place of variable a ?
>> >>>>
>> >>>>
>> >>>> http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
>> >>>>
>> >>>> Cheers
>> >>>>
>> >>>> On Sun, Aug 30, 2015 at 8:54 AM, Ashish Shrowty
>> >>>> <ashish.shro...@gmail.com> wrote:
>> >>>>>
>> >>>>> @Sean - Agree that there is no action, but I still get the
>> >>>>> stackoverflowerror, its very weird
>> >>>>>
>> >>>>> @Ted - Variable a is just an int - val a = 10 ... The error happens
>> >>>>> when I try to pass a variable into the closure. The example you have
>> >>>>> above
>> >>>>> works fine since there is no variable being passed into the closure
>> >>>>> from the
>> >>>>> shell.
>> >>>>>
>> >>>>> -Ashish
>> >>>>>
>> >>>>> On Sun, Aug 30, 2015 at 9:55 AM Ted Yu <yuzhih...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> Using Spark shell :
>> >>>>>>
>> >>>>>> scala> import scala.collection.mutable.MutableList
>> >>>>>> import scala.collection.mutable.MutableList
>> >>>>>>
>> >>>>>> scala> val lst = MutableList[(String,String,Double)]()
>> >>>>>> lst: scala.collection.mutable.MutableList[(String, String, Double)]
>> >>>>>> =
>> >>>>>> MutableList()
>> >>>>>>
>> >>>>>> scala> Range(0,10000).foreach(i=>lst+=(("10","10",i:Double)))
>> >>>>>>
>> >>>>>> scala> val rdd=sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>> >>>>>> <console>:27: error: not found: value a
>> >>>>>>        val rdd=sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>> >>>>>>                                           ^
>> >>>>>>
>> >>>>>> scala> val rdd=sc.makeRDD(lst).map(i=> if(i._1==10) 1 else 0)
>> >>>>>> rdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at
>> >>>>>> <console>:27
>> >>>>>>
>> >>>>>> scala> rdd.count()
>> >>>>>> ...
>> >>>>>> 15/08/30 06:53:40 INFO DAGScheduler: Job 0 finished: count at
>> >>>>>> <console>:30, took 0.478350 s
>> >>>>>> res1: Long = 10000
>> >>>>>>
>> >>>>>> Ashish:
>> >>>>>> Please refine your example to mimic more closely what your code
>> >>>>>> actually did.
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>>
>> >>>>>> On Sun, Aug 30, 2015 at 12:24 AM, Sean Owen <so...@cloudera.com>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> That can't cause any error, since there is no action in your first
>> >>>>>>> snippet. Even calling count on the result doesn't cause an error.
>> >>>>>>> You
>> >>>>>>> must be executing something different.
>> >>>>>>>
>> >>>>>>> On Sun, Aug 30, 2015 at 4:21 AM, ashrowty
>> >>>>>>> <ashish.shro...@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>> > I am running the Spark shell (1.2.1) in local mode and I have a
>> >>>>>>> > simple
>> >>>>>>> > RDD[(String,String,Double)] with about 10,000 objects in it. I
>> >>>>>>> > get
>> >>>>>>> > a
>> >>>>>>> > StackOverFlowError each time I try to run the following code
>> >>>>>>> > (the
>> >>>>>>> > code
>> >>>>>>> > itself is just representative of other logic where I need to
>> >>>>>>> > pass
>> >>>>>>> > in a
>> >>>>>>> > variable). I tried broadcasting the variable too, but no luck ..
>> >>>>>>> > missing
>> >>>>>>> > something basic here -
>> >>>>>>> >
>> >>>>>>> > val rdd = sc.makeRDD(List(<Data read from file>)
>> >>>>>>> > val a=10
>> >>>>>>> > rdd.map(r => if (a==10) 1 else 0)
>> >>>>>>> > This throws -
>> >>>>>>> >
>> >>>>>>> > java.lang.StackOverflowError
>> >>>>>>> >     at
>> >>>>>>> > java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:318)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1133)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>> >>>>>>> > ...
>> >>>>>>> > ...
>> >>>>>>> >
>> >>>>>>> > More experiments  .. this works -
>> >>>>>>> >
>> >>>>>>> > val lst = Range(0,10000).map(i=>("10","10",i:Double)).toList
>> >>>>>>> > sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>> >>>>>>> >
>> >>>>>>> > But below doesn't and throws the StackoverflowError -
>> >>>>>>> >
>> >>>>>>> > val lst = MutableList[(String,String,Double)]()
>> >>>>>>> > Range(0,10000).foreach(i=>lst+=(("10","10",i:Double)))
>> >>>>>>> > sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>> >>>>>>> >
>> >>>>>>> > Any help appreciated!
>> >>>>>>> >
>> >>>>>>> > Thanks,
>> >>>>>>> > Ashish
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > --
>> >>>>>>> > View this message in context:
>> >>>>>>> >
>> >>>>>>> > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-shell-and-StackOverFlowError-tp24508.html
>> >>>>>>> > Sent from the Apache Spark User List mailing list archive at
>> >>>>>>> > Nabble.com.
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > ---------------------------------------------------------------------
>> >>>>>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>>>>>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >>>>>>> >
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> ---------------------------------------------------------------------
>> >>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>
>> >

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark shell and StackOverFlowError

Reply via email to