Hi, We are running KStreams application (2.6.1) with standby replicas set to 1.
Recently one of the instances had an unexpected behaviour. We observed several DisconnectExceptions & TimeoutException in logs due to request timeouts for a single stream thread, logged by the internal restore consumer which is used by a standby task to consume store changelog topics Rowthreadtimestamploggerlevelmessage exception 247 <applicationName>-StreamThread-8 2021-02-08 16:06:05.425439 UTC org.apache.kafka.clients.NetworkClient DEBUG [Consumer clientId=<applicationName>-StreamThread-8-restore-consumer, groupId=null] Disconnecting from node 1596506249 due to request timeout. null 248 <applicationName>-StreamThread-8 2021-02-08 16:06:05.425446 UTC org.apache.kafka.clients.NetworkClient DEBUG [Consumer clientId=<applicationName>-StreamThread-8-restore-consumer, groupId=null] Disconnecting from node 1802747700 due to request timeout. null 249 <applicationName>-StreamThread-8 2021-02-08 16:06:05.425463 UTC org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient DEBUG [Consumer clientId=<applicationName>-StreamThread-8-restore-consumer, groupId=null] Cancelled request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=<applicationName>-StreamThread-8-restore-consumer, correlationId=2102822) due to node 1596506249 being disconnected null 250 <applicationName>-StreamThread-8 2021-02-08 16:06:05.425472 UTC org.apache.kafka.clients.FetchSessionHandler INFO [Consumer clientId=<applicationName>-StreamThread-8-restore-consumer, groupId=null] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 1596506249: org.apache.kafka.common.errors.DisconnectException: null After which the restore consumer was able to retry and connect. These are DEBUG/INFO level logs since there were no ERROR logs at all. However, the impact was that we were not processing events for some time with some of the active tasks in that instance, since the input message delay had spiked (calculated as CurrentTime-EventTime). At the same time we were not able to find anything concerning in application logs (even with DEBUG enabled) related to active tasks and the main consumer/producer used by them. So the question is, given that the standby and active tasks are sharing a thread, in case there is a timeout/disconnect errors in standby restore consumer, could that in theory impact the processing latency for active tasks as well? William Hovnanyan Software Engineer EMAIL whovnan...@twilio.com