I am not sure what you mean by data checkpoint continuously increase, leading to recovery process taking time? Do you mean that in HDFS you are seeing rdd checkpoint files being continuously written but never being deleted?
On Tue, Sep 23, 2014 at 2:40 AM, RodrigoB <rodrigo.boav...@aspect.com> wrote: > Hi all, > > I've just started to take Spark Streaming recovery more seriously as things > get more serious on the project roll-out. We need to ensure full recovery > on > all Spark levels - driver, receiver and worker. > > I've started to do some tests today and become concerned with the current > findings. > > I have an RDD in memory that gets updated through the updatestatebykey > function which is fed by an actor stream. Checkpoint is done on default > values - 10 secs. > > Using the recipe in RecoverableNetworkWordCount I'm recovering that same > RDD. My initial expectation would be that Spark Streaming would be clever > enough to regularly delete old checkpoints as TD mentions on the thread > bellow > > > http://apache-spark-user-list.1001560.n3.nabble.com/checkpoint-and-not-running-out-of-disk-space-td1525.html > > Instead I'm seeing data checkpoint to continuously increase, meaning the > recovery process is taking huge time to conclude as the state based RDD is > getting overwritten multiple times as many times this application was > checkpointed since it first started. > In fact the only version I need is the one from the latest checkpoint. > > I rather not have to implement all the recovery outside of Spark Streaming > (as a few other challenges like avoiding IO re-execution and event stream > recovery will need to be done outside), so I really hope to have some > strong > control on this part. > > How does RDD data checkpoint cleaning happen? Would UpdateStateByKey be a > particular case where there is no cleaning? Would I have to code it to > delete outside of Spark? Sounds dangerous...I haven't looked at the code > yet > but if someone already has that knowledge I would greatly appreciate to get > some insight. > > Note: I'm solely referring to the data checkpoint and not metadata > checkpoint. > > Many Thanks, > Rod > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-data-checkpoint-cleaning-tp14847.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >