Hi,

We are running KStreams application (2.6.1) with standby replicas set to 1.

Recently one of the instances had an unexpected behaviour. We observed
several DisconnectExceptions & TimeoutException in logs due to request
timeouts for a single stream thread,
logged by the internal restore consumer which is used by a standby task to
consume store changelog topics

Rowthreadtimestamploggerlevelmessage    exception
247
<applicationName>-StreamThread-8
2021-02-08 16:06:05.425439 UTC
org.apache.kafka.clients.NetworkClient
DEBUG
[Consumer clientId=<applicationName>-StreamThread-8-restore-consumer,
groupId=null] Disconnecting from node 1596506249 due to request timeout.
null
248
<applicationName>-StreamThread-8
2021-02-08 16:06:05.425446 UTC
org.apache.kafka.clients.NetworkClient
DEBUG
[Consumer clientId=<applicationName>-StreamThread-8-restore-consumer,
groupId=null] Disconnecting from node 1802747700 due to request timeout.
null
249
<applicationName>-StreamThread-8
2021-02-08 16:06:05.425463 UTC
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient
DEBUG
[Consumer clientId=<applicationName>-StreamThread-8-restore-consumer,
groupId=null] Cancelled request with header RequestHeader(apiKey=FETCH,
apiVersion=11, clientId=<applicationName>-StreamThread-8-restore-consumer,
correlationId=2102822) due to node 1596506249 being disconnected
null
250
<applicationName>-StreamThread-8
2021-02-08 16:06:05.425472 UTC
org.apache.kafka.clients.FetchSessionHandler
INFO
[Consumer clientId=<applicationName>-StreamThread-8-restore-consumer,
groupId=null] Error sending fetch request (sessionId=INVALID,
epoch=INITIAL) to node 1596506249:
org.apache.kafka.common.errors.DisconnectException: null

After which the restore consumer was able to retry and connect. These are
DEBUG/INFO level logs since there were no ERROR logs at all.

However, the impact was that we were not processing events for some time
with some of the active tasks in that instance, since the input message
delay had spiked (calculated as CurrentTime-EventTime). At the same time we
were not able to find anything concerning in application logs (even with
DEBUG enabled) related to active tasks and the main consumer/producer used
by them.

So the question is, given that the standby and active tasks are sharing a
thread, in case there is a timeout/disconnect errors in standby restore
consumer, could that in theory impact the processing latency for active
tasks as well?

William Hovnanyan
Software Engineer
EMAIL whovnan...@twilio.com

Reply via email to