Hi,

I'm running Kafka Streams v2.1.0 with windowing function and 3 threads
per node. The input traffic is about 120K messages/sec. Once deploy,
after couple minutes, some thread will get TimeoutException and goes
to DEAD state.

2019-09-27 13:04:34,449 ERROR [client-StreamThread-2]
o.a.k.s.p.i.AssignedStreamsTasks stream-thread [client-StreamThread-2]
Failed to commit stream task 1_35 due to the following error:
org.apache.kafka.streams.errors.StreamsException: task [1_35] Abort
sending since an error caught with a previous record (key
174044298638db0 value [B@33423c0 timestamp 1569613689747) to topic
XXX-KSTREAM-REDUCE-STATE-STORE-0000000003-changelog due to
org.apache.kafka.common.errors.TimeoutException: Expiring 45 record(s)
for XXX-KSTREAM-REDUCE-STATE-STORE-0000000003-changelog-35:300102 ms
has passed since batch creation
You can increase producer parameter `retries` and `retry.backoff.ms`
to avoid this error.
at 
org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:133)
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring
45 record(s) for
XXX-KSTREAM-REDUCE-STATE-STORE-0000000003-changelog-35:300102 ms has
passed since batch creation

I changed linger.ms to 100 and request.timeout.ms to 300000.

A couple questions:
1) The app has a huge consuming lag on both source topic and internal
repartition topics , ~5 M messages and keeps growing. Will the lag
lead to this timeout exception? My understanding is the app polls too
many messages before it could send out even the lag indicates it still
polls too slow?

2) I checked the thread status from streams.localThreadsMetadata() and
found that, a lot threads are switching between RUNNING and
PARTITIONS_REVOKED. Eventually, stuck at PARTITIONS_REVOKED. I didn't
see any error message from the log related to this. What might be a
good approach to trace the cause?

Thanks a lot! Any help is appreciated!

Reply via email to