Re: Why must the dstream.foreachRDD(...) parameter be serializable?

Matei Zaharia Tue, 27 Jan 2015 18:35:15 -0800

I believe this is needed for driver recovery in Spark Streaming. If your Spark 
driver program crashes, Spark Streaming can recover the application by reading 
the set of DStreams and output operations from a checkpoint file (see 
https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing).
 But to do that, it needs to remember all the operations you're running 
periodically, including those in foreachRDD.


Matei

> On Jan 27, 2015, at 6:15 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:
> 
> Hi,
> 
> I want to do something like
> 
> dstream.foreachRDD(rdd => if (someCondition) ssc.stop())
> 
> so in particular the function does not touch any element in the RDD and runs 
> completely within the driver. However, this fails with a 
> NotSerializableException because $outer is not serializable etc. The DStream 
> code says:
> 
>   def foreachRDD(foreachFunc: (RDD[T], Time) => Unit) {
>     // because the DStream is reachable from the outer object here, and 
> because 
>     // DStreams can't be serialized with closures, we can't proactively check 
>     // it for serializability and so we pass the optional false to 
> SparkContext.clean
>     new ForEachDStream(this, context.sparkContext.clean(foreachFunc, 
> false)).register()
>   }
> 
> To be honest, I don't understand the comment. Why must that function be 
> serializable even when there is no RDD action involved?
> 
> Thanks
> Tobias


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

Reply via email to