[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951096#comment-16951096
 ] 

Tom Lee commented on KAFKA-8950:
--------------------------------

[~rsivaram] plaintext transport here too. Not sure we could provide the full 
thread dump, but for now happy to answer specific questions if you have any 
hunches. Will see what I can figure out wrt the thread dump.

[~wtjames] ah I think I see what you're saying. So for consumer thread C and 
heartbeat/coordinator thread H:

C @ T=0: creates the RequestFuture, adds it to _unsent_ but does not quite get 
to the point where we add the listener
H @ T=1: somehow completes the future from T=0 (e.g. disconnect processing)
C @ T=2: adds the listener, which is immediately invoked on the calling thread 
and attempts to remove the nodesWithPendingFetchRequests entry before it has 
been added
C @ T=3: adds the nodesWithPendingFetchRequests entry, which will never be 
removed because the listener has already fired

Only thing I'm not seeing now is how the futures could actually get 
completed/failed directly on H ... from what I can see they'd typically be 
enqueued into _pendingCompletion_ by RequestFutureCompletionHandler & processed 
on the consumer thread rather than being called directly. It would only need to 
happen once, though. Very interesting, does seem precarious at the very least.

> KafkaConsumer stops fetching
> ----------------------------
>
>                 Key: KAFKA-8950
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8950
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 2.3.0
>         Environment: linux
>            Reporter: Will James
>            Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to