Hi Matthias,

Thanks for the reply.

Actually I think no exception was thrown to stream thread, as we can not
see any ERROR logs
and/or rebalance. It looks like this error is being handled in a consumer
level, by catching,
logging and retrying the operation.

I can see the polling for standby is done in a non-blocking way
(Duration.ZERO is passed to .poll),
this clears things up - the issue in restore consumer (whatever it is) can
not impact anything related to active tasks in the same thread.

Actually, looking deeper into DEBUG logs I can see there were an attempt by
one of the active tasks to commit,
which took too long to complete (StreamProducer took some time to send all
the changelog messages to brokers),
which then subsequently caused spikes in delays (since the next poll would
be delayed).

This was caused by host level network issues, which was verified by
observing the number of retransmitted TCP segments.

Thanks alot for the help!

Best Regards,
William Hovnanyan
Software Engineer
EMAIL whovnan...@twilio.com


On Tue, Feb 9, 2021 at 11:54 PM Matthias J. Sax <mj...@apache.org> wrote:

> Well, first of all, in current releases if any timeout exception
> happens, the corresponding thread dies. Thus, if a standby task throws,
> it would impact the active tasks of the same thread and the thread dies
> and all active and standby tasks need to redistributed to remaining
> threads/instances via a rebalance.
>
> We are actually improving this in upcoming `2.8.0` release vie KIP-572.
>
> Beside this, both consumers are used interleaved, ie, the thread polls
> for the main consumer, processed some records, polls for the restore
> consumer, and updates standby tasks and so forth.
>
> Does this answer your question?
>
>
> -Matthias
>
> On 2/9/21 6:50 AM, William Hovnanyan wrote:
> > Hi,
> >
> > We are running KStreams application (2.6.1) with standby replicas set to
> 1.
> >
> > Recently one of the instances had an unexpected behaviour. We observed
> > several DisconnectExceptions & TimeoutException in logs due to request
> > timeouts for a single stream thread,
> > logged by the internal restore consumer which is used by a standby task
> to
> > consume store changelog topics
> >
> > Rowthreadtimestamploggerlevelmessage    exception
> > 247
> > <applicationName>-StreamThread-8
> > 2021-02-08 16:06:05.425439 UTC
> > org.apache.kafka.clients.NetworkClient
> > DEBUG
> > [Consumer clientId=<applicationName>-StreamThread-8-restore-consumer,
> > groupId=null] Disconnecting from node 1596506249 due to request timeout.
> > null
> > 248
> > <applicationName>-StreamThread-8
> > 2021-02-08 16:06:05.425446 UTC
> > org.apache.kafka.clients.NetworkClient
> > DEBUG
> > [Consumer clientId=<applicationName>-StreamThread-8-restore-consumer,
> > groupId=null] Disconnecting from node 1802747700 due to request timeout.
> > null
> > 249
> > <applicationName>-StreamThread-8
> > 2021-02-08 16:06:05.425463 UTC
> > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient
> > DEBUG
> > [Consumer clientId=<applicationName>-StreamThread-8-restore-consumer,
> > groupId=null] Cancelled request with header RequestHeader(apiKey=FETCH,
> > apiVersion=11,
> clientId=<applicationName>-StreamThread-8-restore-consumer,
> > correlationId=2102822) due to node 1596506249 being disconnected
> > null
> > 250
> > <applicationName>-StreamThread-8
> > 2021-02-08 16:06:05.425472 UTC
> > org.apache.kafka.clients.FetchSessionHandler
> > INFO
> > [Consumer clientId=<applicationName>-StreamThread-8-restore-consumer,
> > groupId=null] Error sending fetch request (sessionId=INVALID,
> > epoch=INITIAL) to node 1596506249:
> > org.apache.kafka.common.errors.DisconnectException: null
> >
> > After which the restore consumer was able to retry and connect. These are
> > DEBUG/INFO level logs since there were no ERROR logs at all.
> >
> > However, the impact was that we were not processing events for some time
> > with some of the active tasks in that instance, since the input message
> > delay had spiked (calculated as CurrentTime-EventTime). At the same time
> we
> > were not able to find anything concerning in application logs (even with
> > DEBUG enabled) related to active tasks and the main consumer/producer
> used
> > by them.
> >
> > So the question is, given that the standby and active tasks are sharing a
> > thread, in case there is a timeout/disconnect errors in standby restore
> > consumer, could that in theory impact the processing latency for active
> > tasks as well?
> >
> > William Hovnanyan
> > Software Engineer
> > EMAIL whovnan...@twilio.com
> >
>

Reply via email to