I think as long as you have adequate monitoring and Kafka retention, the simplest solution is safest - let it crash. On May 14, 2015 4:00 PM, "badgerpants" <mark.stew...@tapjoy.com> wrote:
> We've been using the new DirectKafkaInputDStream to implement an exactly > once > processing solution that tracks the provided offset ranges within the same > transaction that persists our data results. When an exception is thrown > within the processing loop and the configured number of retries are > exhausted the stream will skip to the end of the failed range of offsets > and > continue on with the next RDD. > > Makes sense but we're wondering how others would handle recovering from > failures. In our case the cause of the exception was a temporary outage of > a > needed service. Since the transaction rolled back at the point of failure > our offset tracking table retained the correct offsets updated so we simply > needed to restart the Spark process whereupon it happily picked up at the > correct point and continued. Short of the restart do people have any good > ideas for how we might recover? > > FWIW We've looked at setting spark.task.maxFailures param to a large value > and looked for a property that would increase the wait between attempts. > This might mitigate the issue when the availability problem is short lived > but wouldn't completely eliminate the need to restart. > > Any thoughts, ideas welcome. > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Error-recovery-strategies-using-the-DirectKafkaInputDStream-tp12258.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >