Hi, I'm running Kafka Streams v2.1.0 with windowing function and 3 threads per node. The input traffic is about 120K messages/sec. Once deploy, after couple minutes, some thread will get TimeoutException and goes to DEAD state.
2019-09-27 13:04:34,449 ERROR [client-StreamThread-2] o.a.k.s.p.i.AssignedStreamsTasks stream-thread [client-StreamThread-2] Failed to commit stream task 1_35 due to the following error: org.apache.kafka.streams.errors.StreamsException: task [1_35] Abort sending since an error caught with a previous record (key 174044298638db0 value [B@33423c0 timestamp 1569613689747) to topic XXX-KSTREAM-REDUCE-STATE-STORE-0000000003-changelog due to org.apache.kafka.common.errors.TimeoutException: Expiring 45 record(s) for XXX-KSTREAM-REDUCE-STATE-STORE-0000000003-changelog-35:300102 ms has passed since batch creation You can increase producer parameter `retries` and `retry.backoff.ms` to avoid this error. at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:133) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 45 record(s) for XXX-KSTREAM-REDUCE-STATE-STORE-0000000003-changelog-35:300102 ms has passed since batch creation I changed linger.ms to 100 and request.timeout.ms to 300000. A couple questions: 1) The app has a huge consuming lag on both source topic and internal repartition topics , ~5 M messages and keeps growing. Will the lag lead to this timeout exception? My understanding is the app polls too many messages before it could send out even the lag indicates it still polls too slow? 2) I checked the thread status from streams.localThreadsMetadata() and found that, a lot threads are switching between RUNNING and PARTITIONS_REVOKED. Eventually, stuck at PARTITIONS_REVOKED. I didn't see any error message from the log related to this. What might be a good approach to trace the cause? Thanks a lot! Any help is appreciated!