[ 
https://issues.apache.org/jira/browse/KAFKA-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841226#comment-16841226
 ] 

Randall Hauch commented on KAFKA-7941:
--------------------------------------

It looks like KAFKA-6608 changed the `consumer.endOffsets(...)` methods in AK 
2.0 to add an optional timeout parameter, and for the existing method to 
default to the value set by the `request.timeout.ms` consumer property, which 
itself defaults to 30 seconds. This added the possibility of a TimeoutException 
on these methods, which didn't have it before AK 2.0.

So, one workaround is to set the `request.timeout.ms` property for Connect 
worker's configuration, which is used for the worker's consumer used for 
offsets and other internal topics. Note that doing this will affect the 
producer of internal components, too, and unless it's overridden for the 
worker's `producer.request.timeout.ms` or `consumer.request.timeout.ms` will 
also apply to producers and consumers used for connectors.

> Connect KafkaBasedLog work thread terminates when getting offsets fails 
> because broker is unavailable
> -----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7941
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7941
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Paul Whalen
>            Assignee: Paul Whalen
>            Priority: Minor
>
> My team has run into this Connect bug regularly in the last six months while 
> doing infrastructure maintenance that causes intermittent broker availability 
> issues.  I'm a little surprised it exists given how routinely it affects us, 
> so perhaps someone in the know can point out if our setup is somehow just 
> incorrect.  My team is running 2.0.0 on both the broker and client, though 
> from what I can tell from reading the code, the issue continues to exist 
> through 2.2; at least, I was able to write a failing unit test that I believe 
> reproduces it.
> When a {{KafkaBasedLog}} worker thread in the Connect runtime calls 
> {{readLogToEnd}} and brokers are unavailable, the {{TimeoutException}} from 
> the consumer {{endOffsets}} call is uncaught all the way up to the top level 
> {{catch (Throwable t)}}, effectively killing the thread until restarting 
> Connect.  The result is Connect stops functioning entirely, with no 
> indication except for that log line - tasks still show as running.
> The proposed fix is to simply catch and log the {{TimeoutException}}, 
> allowing the worker thread to retry forever.
> Alternatively, perhaps there is not an expectation that Connect should be 
> able to recover following broker unavailability, though that would be 
> disappointing.  I would at least hope hope for a louder failure then the 
> single {{ERROR}} log.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to