I reached a similar conclusion about checkpointing . It requires your entire computation to be serializable, even all of the 'local' bits. Which makes sense. In my case I do not use checkpointing and it is fine to restart the driver in the case of failure and not try to recover its state.
What I haven't investigated is whether you can enable checkpointing for the state in updateStateByKey separately from this mechanism, which is exactly your question. What happens if you set a checkpoint dir, but do *not* use StreamingContext.getOrCreate, but *do* call DStream.checkpoint? On Mon, Feb 23, 2015 at 8:47 AM, Tobias Pfeiffer <t...@preferred.jp> wrote: > Hi, > > On Wed, Jan 28, 2015 at 7:01 PM, Sean Owen <so...@cloudera.com> wrote: >> >> PS see https://issues.apache.org/jira/browse/SPARK-4196 for an example. > > > OK, I am only realizing now the huge effect this has on my project. Even the > simplest > stream.foreachRDD( ... logger.info(...) ... ) > pieces need to be rewritten when checkpointing is enabled, this is not so > good. However, it should be possible, I think. > > But, I think this makes it impossible to use SparkSQL together with Spark > Streaming, as every > stream.foreachRDD( ... sqlc.sql("SELECT ...") ...) > will fail with > java.io.NotSerializableException: > scala.util.parsing.combinator.Parsers$$anon$3 > etc. > > Actually I don't want checkpointing/failure recovery, I only want to use > updateStateByKey. Is there maybe any workaround to get updateStateByKey > working while keeping my foreachRDD functions as they are? > > Thanks > Tobias > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org