[jira] [Updated] (KAFKA-7802) Connection to Broker Disconnected Taking Down the Whole Cluster

Candice Wan (JIRA) Tue, 08 Jan 2019 23:31:02 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-7802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Candice Wan updated KAFKA-7802:
-------------------------------
    Description: 
We recently upgraded to 2.1.0. Since then, several times per day, we observe 
some brokers were disconnected when other brokers were trying to fetch the 
replicas. This issue took down the whole cluster, making all the producers and 
consumers not able to publish or consume messages. It could be quickly fixed by 
restarting the problematic broker.

Here is an example of what we're seeing in the broker which was trying to send 
fetch request to the problematic one:

2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO 
o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) 
to node 3: java.io.IOException: Connection to 3 was disconnected before the 
response was read.
 2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO 
o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) 
to node 3: java.io.IOException: Connection to 3 was disconnected before the 
response was read.
 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN 
kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, 
fetcherId=0] Error in response for fetch request (type=FetchRequest, 
replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, 
fetchData={__consumer_offsets-11=(offset=421032847, logStartOffset=0, 
maxBytes=1048576, currentLeaderEpoch=Optional[178])}, 
isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, 
epoch=1599941))
 java.io.IOException: Connection to 3 was disconnected before the response was 
read
 at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
 at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99)
 at 
kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199)
 at 
kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
 at 
kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
 at 
kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
 at scala.Option.foreach(Option.scala:257)
 at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
 at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
 at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

 

 

 Below is the suspicious log of the problematic broker when the issue happened:

2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 in group talon-instance1 has 
failed, rem
 oving it from the group
 2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
rebalance group talon-instance1 in state PreparingRebalance with old generation 
27
 0 (__consumer_offsets-47) (reason: removing member 
consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 on heartbeat expiration)
 2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b in group 
Notifications.ASIA1546980352799 has failed, removing it from the group
 2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
rebalance group Notifications.ASIA1546980352799 in state PreparingRebalance 
with old generation 1 (__consumer_offsets-44) (reason: removing member 
consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b on heartbeat expiration)
 2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
Notifications.ASIA1546980352799 with generation 2 is now empty 
(__consumer_offsets-44)
 2019-01-09 08:04:50.388 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-3-0a4c55c2-9f31-4e7a-b0d7-1f057dceb03d in group talon-instance1 has 
failed, removing it from the group
 2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 in group talon-kafka-vision has 
failed, removing it from the group
 2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
rebalance group talon-kafka-vision in state PreparingRebalance with old 
generation 9 (__consumer_offsets-26) (reason: removing member 
consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 on heartbeat expiration)
 2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
talon-kafka-vision with generation 10 is now empty (__consumer_offsets-26)
 2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 in group talon-hades-instance1 
has failed, removing it from the group
 2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
rebalance group talon-hades-instance1 in state PreparingRebalance with old 
generation 122 (__consumer_offsets-11) (reason: removing member 
consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 on heartbeat expiration)
 2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
talon-hades-instance1 with generation 123 is now empty (__consumer_offsets-11)
 2019-01-09 08:04:50.422 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-4-a527e579-7a14-471b-b19d-ffec50074bb8 in group talon-instance1 has 
failed, removing it from the group
 2019-01-09 08:04:50.434 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-4-0c470e05-5e9a-4cae-a493-9854a6d0c8a7 in group talon-instance1 has 
failed, removing it from the group
 2019-01-09 08:04:50.514 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-2-155ea6c8-c90f-4af6-b65e-138a151d77d9 in group talon-instance1 has 
failed, removing it from the group
 2019-01-09 08:04:55.297 [executor-Produce] WARN 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
empty metadata for group Notifications.ASIA1546980352799: The group is 
rebalancing, so a rejoin is needed.
 2019-01-09 08:04:55.419 [executor-Produce] WARN 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
empty metadata for group talon-kafka-vision: The group is rebalancing, so a 
rejoin is needed.
 2019-01-09 08:04:55.420 [executor-Produce] WARN 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
empty metadata for group talon-hades-instance1: The group is rebalancing, so a 
rejoin is needed.

 

We also took the thread dump of the problematic broker (attached). We found all 
the kafka-request-handler were hanging and waiting for some locks, which seemed 
to be a resource leak there.

 

The java version we are running is 11.0.1

  was:
We recently upgraded to 2.1.0. Since then, several times per day, we observe 
some brokers were disconnected when other brokers were trying to fetch the 
replicas. This issue took down the whole cluster, making all the producers and 
consumers not able to publish or consume messages.

Here is an example of what we're seeing in the broker which was trying to send 
fetch request to the problematic one:

2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO 
o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) 
to node 3: java.io.IOException: Connection to 3 was disconnected before the 
response was read.
 2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO 
o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) 
to node 3: java.io.IOException: Connection to 3 was disconnected before the 
response was read.
 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN 
kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, 
fetcherId=0] Error in response for fetch request (type=FetchRequest, 
replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, 
fetchData={__consumer_offsets-11=(offset=421032847, logStartOffset=0, 
maxBytes=1048576, currentLeaderEpoch=Optional[178])}, 
isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, 
epoch=1599941))
 java.io.IOException: Connection to 3 was disconnected before the response was 
read
 at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
 at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99)
 at 
kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199)
 at 
kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
 at 
kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
 at 
kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
 at scala.Option.foreach(Option.scala:257)
 at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
 at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
 at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

 

 

 Below is the suspicious log of the problematic broker when the issue happened:

2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 in group talon-instance1 has 
failed, rem
oving it from the group
2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
rebalance group talon-instance1 in state PreparingRebalance with old generation 
27
0 (__consumer_offsets-47) (reason: removing member 
consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 on heartbeat expiration)
2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b in group 
Notifications.ASIA1546980352799 has failed, removing it from the group
2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
rebalance group Notifications.ASIA1546980352799 in state PreparingRebalance 
with old generation 1 (__consumer_offsets-44) (reason: removing member 
consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b on heartbeat expiration)
2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
Notifications.ASIA1546980352799 with generation 2 is now empty 
(__consumer_offsets-44)
2019-01-09 08:04:50.388 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-3-0a4c55c2-9f31-4e7a-b0d7-1f057dceb03d in group talon-instance1 has 
failed, removing it from the group
2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 in group talon-kafka-vision has 
failed, removing it from the group
2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
rebalance group talon-kafka-vision in state PreparingRebalance with old 
generation 9 (__consumer_offsets-26) (reason: removing member 
consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 on heartbeat expiration)
2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
talon-kafka-vision with generation 10 is now empty (__consumer_offsets-26)
2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 in group talon-hades-instance1 
has failed, removing it from the group
2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
rebalance group talon-hades-instance1 in state PreparingRebalance with old 
generation 122 (__consumer_offsets-11) (reason: removing member 
consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 on heartbeat expiration)
2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
talon-hades-instance1 with generation 123 is now empty (__consumer_offsets-11)
2019-01-09 08:04:50.422 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-4-a527e579-7a14-471b-b19d-ffec50074bb8 in group talon-instance1 has 
failed, removing it from the group
2019-01-09 08:04:50.434 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-4-0c470e05-5e9a-4cae-a493-9854a6d0c8a7 in group talon-instance1 has 
failed, removing it from the group
2019-01-09 08:04:50.514 [executor-Heartbeat] INFO 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
consumer-2-155ea6c8-c90f-4af6-b65e-138a151d77d9 in group talon-instance1 has 
failed, removing it from the group
2019-01-09 08:04:55.297 [executor-Produce] WARN 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
empty metadata for group Notifications.ASIA1546980352799: The group is 
rebalancing, so a rejoin is needed.
2019-01-09 08:04:55.419 [executor-Produce] WARN 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
empty metadata for group talon-kafka-vision: The group is rebalancing, so a 
rejoin is needed.
2019-01-09 08:04:55.420 [executor-Produce] WARN 
k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
empty metadata for group talon-hades-instance1: The group is rebalancing, so a 
rejoin is needed.

 

We also took the thread dump of the problematic broker (attached). We found all 
the kafka-request-handler were hanging and waiting for some locks, which seemed 
to be a resource leak there.

 

The java version we are running is 11.0.1


> Connection to Broker Disconnected Taking Down the Whole Cluster
> ---------------------------------------------------------------
>
>                 Key: KAFKA-7802
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7802
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.1.0
>            Reporter: Candice Wan
>            Priority: Critical
>         Attachments: thread_dump.log
>
>
> We recently upgraded to 2.1.0. Since then, several times per day, we observe 
> some brokers were disconnected when other brokers were trying to fetch the 
> replicas. This issue took down the whole cluster, making all the producers 
> and consumers not able to publish or consume messages. It could be quickly 
> fixed by restarting the problematic broker.
> Here is an example of what we're seeing in the broker which was trying to 
> send fetch request to the problematic one:
> 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO 
> o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) 
> to node 3: java.io.IOException: Connection to 3 was disconnected before the 
> response was read.
>  2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO 
> o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) 
> to node 3: java.io.IOException: Connection to 3 was disconnected before the 
> response was read.
>  2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN 
> kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={__consumer_offsets-11=(offset=421032847, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[178])}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, 
> epoch=1599941))
>  java.io.IOException: Connection to 3 was disconnected before the response 
> was read
>  at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
>  at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99)
>  at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199)
>  at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
>  at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
>  at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
>  at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
>  
>  
>  Below is the suspicious log of the problematic broker when the issue 
> happened:
> 2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 in group talon-instance1 has 
> failed, rem
>  oving it from the group
>  2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group talon-instance1 in state PreparingRebalance with old 
> generation 27
>  0 (__consumer_offsets-47) (reason: removing member 
> consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 on heartbeat expiration)
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b in group 
> Notifications.ASIA1546980352799 has failed, removing it from the group
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group Notifications.ASIA1546980352799 in state PreparingRebalance 
> with old generation 1 (__consumer_offsets-44) (reason: removing member 
> consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b on heartbeat expiration)
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
> Notifications.ASIA1546980352799 with generation 2 is now empty 
> (__consumer_offsets-44)
>  2019-01-09 08:04:50.388 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-3-0a4c55c2-9f31-4e7a-b0d7-1f057dceb03d in group talon-instance1 has 
> failed, removing it from the group
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 in group talon-kafka-vision 
> has failed, removing it from the group
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group talon-kafka-vision in state PreparingRebalance with old 
> generation 9 (__consumer_offsets-26) (reason: removing member 
> consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 on heartbeat expiration)
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
> talon-kafka-vision with generation 10 is now empty (__consumer_offsets-26)
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 in group 
> talon-hades-instance1 has failed, removing it from the group
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group talon-hades-instance1 in state PreparingRebalance with old 
> generation 122 (__consumer_offsets-11) (reason: removing member 
> consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 on heartbeat expiration)
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
> talon-hades-instance1 with generation 123 is now empty (__consumer_offsets-11)
>  2019-01-09 08:04:50.422 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-4-a527e579-7a14-471b-b19d-ffec50074bb8 in group talon-instance1 has 
> failed, removing it from the group
>  2019-01-09 08:04:50.434 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-4-0c470e05-5e9a-4cae-a493-9854a6d0c8a7 in group talon-instance1 has 
> failed, removing it from the group
>  2019-01-09 08:04:50.514 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-2-155ea6c8-c90f-4af6-b65e-138a151d77d9 in group talon-instance1 has 
> failed, removing it from the group
>  2019-01-09 08:04:55.297 [executor-Produce] WARN 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
> empty metadata for group Notifications.ASIA1546980352799: The group is 
> rebalancing, so a rejoin is needed.
>  2019-01-09 08:04:55.419 [executor-Produce] WARN 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
> empty metadata for group talon-kafka-vision: The group is rebalancing, so a 
> rejoin is needed.
>  2019-01-09 08:04:55.420 [executor-Produce] WARN 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
> empty metadata for group talon-hades-instance1: The group is rebalancing, so 
> a rejoin is needed.
>  
> We also took the thread dump of the problematic broker (attached). We found 
> all the kafka-request-handler were hanging and waiting for some locks, which 
> seemed to be a resource leak there.
>  
> The java version we are running is 11.0.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KAFKA-7802) Connection to Broker Disconnected Taking Down the Whole Cluster

Reply via email to