Without knowing more about what's being stored in your checkpoint directory / what the log output is, it's hard to say. But either way, just deleting the checkpoint directory probably isn't sufficient to restart the job...
On Mon, Nov 9, 2015 at 2:40 PM, swetha kasireddy <swethakasire...@gmail.com> wrote: > OK. But, one thing that I observed is that when there is a problem with > Kafka Stream, unless I delete the checkpoint directory the Streaming job > does not restart. I guess it tries to retry the failed tasks and if it's > not able to recover, it fails again. Sometimes, it fails with StackOverFlow > Error. > > Why does the Streaming job not restart from checkpoint directory when the > job failed earlier with Kafka Brokers getting messed up? We have the > checkpoint directory in our hdfs. > > On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> I don't think deleting the checkpoint directory is a good way to restart >> the streaming job, you should stop the spark context or at the very least >> kill the driver process, then restart. >> >> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy < >> swethakasire...@gmail.com> wrote: >> >>> Hi Cody, >>> >>> Our job is our failsafe as we don't have Control over Kafka Stream as of >>> now. Can setting rebalance max retries help? We do not have any monitors >>> setup as of now. We need to setup the monitors. >>> >>> My idea is to to have some kind of Cron job that queries the Streaming >>> API for monitoring like every 5 minutes and then send an email alert and >>> automatically restart the Streaming job by deleting the Checkpoint >>> directory. Would that help? >>> >>> >>> >>> Thanks! >>> >>> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org> >>> wrote: >>> >>>> The direct stream will fail the task if there is a problem with the >>>> kafka broker. Spark will retry failed tasks automatically, which should >>>> handle broker rebalances that happen in a timely fashion. >>>> spark.tax.maxFailures controls the maximum number of retries before failing >>>> the job. Direct stream isn't any different from any other spark task in >>>> that regard. >>>> >>>> The question of what kind of monitoring you need is more a question for >>>> your particular infrastructure and what you're already using for >>>> monitoring. We put all metrics (application level or system level) into >>>> graphite and alert from there. >>>> >>>> I will say that if you've regularly got problems with kafka falling >>>> over for half an hour, I'd look at fixing that before worrying about spark >>>> monitoring... >>>> >>>> >>>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> How to recover Kafka Direct automatically when the there is a problem >>>>> with >>>>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the >>>>> entire >>>>> Streaming job blows up unlike some other consumers which do recover >>>>> automatically. How can I make sure that Kafka Direct recovers >>>>> automatically >>>>> when the broker fails for sometime say 30 minutes? What kind of >>>>> monitors >>>>> should be in place to recover the job? >>>>> >>>>> Thanks, >>>>> Swetha >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>>> >>> >> >