Ok, we can try that. Some other settings to try?
On Thu, Oct 5, 2017 at 20:42 Stas Chizhov <schiz...@gmail.com> wrote: > I would set it to Integer.MAX_VALUE > > 2017-10-05 19:29 GMT+02:00 Dmitriy Vsekhvalnov <dvsekhval...@gmail.com>: > > > I see, but producer.retries set to 10 by default. > > > > What value would you recommend to survive random broker crashes? > > > > On Thu, Oct 5, 2017 at 8:24 PM, Stas Chizhov <schiz...@gmail.com> wrote: > > > > > It is a producer config: > > > https://docs.confluent.io/current/streams/developer- > > > guide.html#non-streams-configuration-parameters > > > > > > 2017-10-05 19:12 GMT+02:00 Dmitriy Vsekhvalnov <dvsekhval...@gmail.com > >: > > > > > > > replication.factor set to match source topics. (3 in our case). > > > > > > > > what do you mean by retires? I don't see retries property in > > StreamConfig > > > > class. > > > > > > > > On Thu, Oct 5, 2017 at 7:55 PM, Stas Chizhov <schiz...@gmail.com> > > wrote: > > > > > > > > > Hi > > > > > > > > > > Have you set replication.factor and retries properties? > > > > > > > > > > BR > > > > > > > > > > tors 5 okt. 2017 kl. 18:45 skrev Dmitriy Vsekhvalnov < > > > > > dvsekhval...@gmail.com > > > > > >: > > > > > > > > > > > Hi all, > > > > > > > > > > > > we were testing Kafka cluster outages by randomly crashing broker > > > nodes > > > > > (1 > > > > > > of 3 for instance) while still keeping majority of replicas > > > available. > > > > > > > > > > > > Time to time our kafka-stream app crashing with exception: > > > > > > > > > > > > [ERROR] [StreamThread-1] > > > > > > [org.apache.kafka.streams.processor.internals.StreamThread] > > > > > [stream-thread > > > > > > [StreamThread-1] Failed while executing StreamTask 0_1 due to > flush > > > > > state: > > > > > > ] org.apache.kafka.streams.errors.StreamsException: task [0_1] > > > > exception > > > > > > caught when producing at > > > > > > > > > > > > org.apache.kafka.streams.processor.internals.RecordCollectorImpl. > > > > > checkForException(RecordCollectorImpl.java:121) > > > > > > at > > > > > > > > > > > > org.apache.kafka.streams.processor.internals. > > > > RecordCollectorImpl.flush( > > > > > RecordCollectorImpl.java:129) > > > > > > at > > > > > > > > > > > > org.apache.kafka.streams.processor.internals. > > StreamTask.flushState( > > > > > StreamTask.java:422) > > > > > > at > > > > > > > > > > > > > org.apache.kafka.streams.processor.internals.StreamThread$4.apply( > > > > > StreamThread.java:555) > > > > > > at > > > > > > > > > > > > org.apache.kafka.streams.processor.internals. > > > > > StreamThread.performOnTasks(StreamThread.java:501) > > > > > > at > > > > > > > > > > > > org.apache.kafka.streams.processor.internals. > > > > StreamThread.flushAllState( > > > > > StreamThread.java:551) > > > > > > at > > > > > > > > > > > > org.apache.kafka.streams.processor.internals.StreamThread. > > > > > shutdownTasksAndState(StreamThread.java:449) > > > > > > at > > > > > > > > > > > > org.apache.kafka.streams.processor.internals. > > StreamThread.shutdown( > > > > > StreamThread.java:391) > > > > > > at > > > > > > > > > > > > org.apache.kafka.streams.processor.internals. > > > > > StreamThread.run(StreamThread.java:372) > > > > > > Caused by: org.apache.kafka.common.errors.TimeoutException: > > > Expiring 1 > > > > > > record(s) for > > > > > > > > > > > > audit-metrics-collector-store_context.operation-message. > > > > > bucketStartMs-message.dc-repartition-0: > > > > > > 30026 ms has passed since batch creation plus linger time > > > > > > > > > > > > after that clearly only restart of app and it continues > processing. > > > > > > > > > > > > We believe it is correlated with our outage testing and question > > is: > > > > what > > > > > > are recommended options for make kafka-stream application more > > > > resilient > > > > > to > > > > > > broker crashes? > > > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > > > > > > >