Re: Why must the dstream.foreachRDD(...) parameter be serializable?

Sean Owen Mon, 23 Feb 2015 01:06:34 -0800

I reached a similar conclusion about checkpointing . It requires your
entire computation to be serializable, even all of the 'local' bits.
Which makes sense. In my case I do not use checkpointing and it is
fine to restart the driver in the case of failure and not try to
recover its state.


What I haven't investigated is whether you can enable checkpointing
for the state in updateStateByKey separately from this mechanism,
which is exactly your question. What happens if you set a checkpoint
dir, but do *not* use StreamingContext.getOrCreate, but *do* call
DStream.checkpoint?

On Mon, Feb 23, 2015 at 8:47 AM, Tobias Pfeiffer <t...@preferred.jp> wrote:
> Hi,
>
> On Wed, Jan 28, 2015 at 7:01 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> PS see https://issues.apache.org/jira/browse/SPARK-4196 for an example.
>
>
> OK, I am only realizing now the huge effect this has on my project. Even the
> simplest
>   stream.foreachRDD( ... logger.info(...) ... )
> pieces need to be rewritten when checkpointing is enabled, this is not so
> good. However, it should be possible, I think.
>
> But, I think this makes it impossible to use SparkSQL together with Spark
> Streaming, as every
>   stream.foreachRDD( ... sqlc.sql("SELECT ...") ...)
> will fail with
>   java.io.NotSerializableException:
> scala.util.parsing.combinator.Parsers$$anon$3
> etc.
>
> Actually I don't want checkpointing/failure recovery, I only want to use
> updateStateByKey. Is there maybe any workaround to get updateStateByKey
> working while keeping my foreachRDD functions as they are?
>
> Thanks
> Tobias
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

Reply via email to