Without knowing more about what's being stored in your checkpoint directory
/ what the log output is, it's hard to say.  But either way, just deleting
the checkpoint directory probably isn't sufficient to restart the job...

On Mon, Nov 9, 2015 at 2:40 PM, swetha kasireddy <swethakasire...@gmail.com>
wrote:

> OK. But, one thing that I observed is that when there is a problem with
> Kafka Stream, unless I delete the checkpoint directory the Streaming job
> does not restart. I guess it tries to retry the failed tasks and if it's
> not able to recover, it fails again. Sometimes, it fails with StackOverFlow
> Error.
>
> Why does the Streaming job not restart from checkpoint directory when the
> job failed earlier with Kafka Brokers getting messed up? We have the
> checkpoint directory in our hdfs.
>
> On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> I don't think deleting the checkpoint directory is a good way to restart
>> the streaming job, you should stop the spark context or at the very least
>> kill the driver process, then restart.
>>
>> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <
>> swethakasire...@gmail.com> wrote:
>>
>>> Hi Cody,
>>>
>>> Our job is our failsafe as we don't have Control over Kafka Stream as of
>>> now. Can setting rebalance max retries help? We do not have any monitors
>>> setup as of now. We need to setup the monitors.
>>>
>>> My idea is to to have some kind of Cron job that queries the Streaming
>>> API for monitoring like every 5 minutes and then send an email alert and
>>> automatically restart the Streaming job by deleting the Checkpoint
>>> directory. Would that help?
>>>
>>>
>>>
>>> Thanks!
>>>
>>> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> The direct stream will fail the task if there is a problem with the
>>>> kafka broker.  Spark will retry failed tasks automatically, which should
>>>> handle broker rebalances that happen in a timely fashion.
>>>> spark.tax.maxFailures controls the maximum number of retries before failing
>>>> the job.  Direct stream isn't any different from any other spark task in
>>>> that regard.
>>>>
>>>> The question of what kind of monitoring you need is more a question for
>>>> your particular infrastructure and what you're already using for
>>>> monitoring.  We put all metrics (application level or system level) into
>>>> graphite and alert from there.
>>>>
>>>> I will say that if you've regularly got problems with kafka falling
>>>> over for half an hour, I'd look at fixing that before worrying about spark
>>>> monitoring...
>>>>
>>>>
>>>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> How to recover Kafka Direct automatically when the there is a problem
>>>>> with
>>>>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the
>>>>> entire
>>>>> Streaming job blows up unlike some other consumers which do recover
>>>>> automatically. How can I make sure that Kafka Direct recovers
>>>>> automatically
>>>>> when the broker fails for sometime say 30 minutes? What kind of
>>>>> monitors
>>>>> should be in place to recover the job?
>>>>>
>>>>> Thanks,
>>>>> Swetha
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to