[ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]
Varsha Abhinandan updated KAFKA-8673:
-------------------------------------
    Description: 
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator - 
https://issues.apache.org/jira/secure/attachment/12974837/Screen%20Shot%202019-07-11%20at%2012.08.09%20PM.png
 # The list of notYetJoinedMembers client id's matched with the threads waiting 
for their offsets to be committed. 
{code:java}
[List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
 
clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
 clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]

vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
"metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
condition"
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x00007fc53c047800 nid=0xac waiting on 
condition [0x00007fc4e68e7000]
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
 #93 daemon prio=5 os_prio=0 tid=0x00007fc53c2b5800 nid=0x9d waiting on 
condition [0x00007fc4e77f6000]
"metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
 #125 daemon prio=5 os_prio=0 tid=0x00007fe18017c800 nid=0xbc waiting on 
condition [0x00007fe12e7e8000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
 #154 daemon prio=5 os_prio=0 tid=0x00007f27c4225800 nid=0xc4 waiting on 
condition [0x00007f2772bec000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27"
 #143 daemon prio=5 os_prio=0 tid=0x00007f27c4365800 nid=0xb9 waiting on 
condition [0x00007f27736f7000]
{code}
 

  was:
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator - 
 # The list of notYetJoinedMembers client id's matched with the threads waiting 
for their offsets to be committed. 
{code:java}
[List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
 
clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
 clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]

vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
"metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
condition"
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x00007fc53c047800 nid=0xac waiting on 
condition [0x00007fc4e68e7000]
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
 #93 daemon prio=5 os_prio=0 tid=0x00007fc53c2b5800 nid=0x9d waiting on 
condition [0x00007fc4e77f6000]
"metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
 #125 daemon prio=5 os_prio=0 tid=0x00007fe18017c800 nid=0xbc waiting on 
condition [0x00007fe12e7e8000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
 #154 daemon prio=5 os_prio=0 tid=0x00007f27c4225800 nid=0xc4 waiting on 
condition [0x00007f2772bec000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27"
 #143 daemon prio=5 os_prio=0 tid=0x00007f27c4365800 nid=0xb9 waiting on 
condition [0x00007f27736f7000]
{code}
 


> Kafka stream threads stuck while sending offsets to transaction preventing 
> join group from completing
> -----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-8673
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8673
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer, streams
>    Affects Versions: 2.2.0
>            Reporter: Varsha Abhinandan
>            Priority: Major
>         Attachments: Screen Shot 2019-07-11 at 12.08.09 PM.png
>
>
> We observed a deadlock kind of a situation in our Kafka streams application 
> when we accidentally shut down all the brokers. The Kafka cluster was brought 
> back in about an hour. 
> Observations made :
>  # Normal Kafka producers and consumers started working fine after the 
> brokers were up again. 
>  # The Kafka streams applications were stuck in the "rebalancing" state.
>  # The Kafka streams apps have exactly-once semantics enabled.
>  # The stack trace showed most of the stream threads sending the join group 
> requests to the group co-ordinator
>  # Few stream threads couldn't initiate the join group request since the call 
> to 
> [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
>  was stuck.
>  # Seems like the join group requests were getting parked at the coordinator 
> since the expected members hadn't sent their own group join requests
>  # And after the timeout, the stream threads that were not stuck sent a new 
> join group requests.  
>  # Maybe (6) and (7) is happening infinitely
>  # Sample values of the GroupMetadata object on the group co-ordinator - 
> https://issues.apache.org/jira/secure/attachment/12974837/Screen%20Shot%202019-07-11%20at%2012.08.09%20PM.png
>  # The list of notYetJoinedMembers client id's matched with the threads 
> waiting for their offsets to be committed. 
> {code:java}
> [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
>  
> clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
>  clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
>  
> clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
>  clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
>  
> clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
>  clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
>  
> clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
>  clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
>  
> clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
>  clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]
> vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
> "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
> condition"
> "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
>  #128 daemon prio=5 os_prio=0 tid=0x00007fc53c047800 nid=0xac waiting on 
> condition [0x00007fc4e68e7000]
> "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
>  #93 daemon prio=5 os_prio=0 tid=0x00007fc53c2b5800 nid=0x9d waiting on 
> condition [0x00007fc4e77f6000]
> "metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
>  #125 daemon prio=5 os_prio=0 tid=0x00007fe18017c800 nid=0xbc waiting on 
> condition [0x00007fe12e7e8000]
> "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
>  #154 daemon prio=5 os_prio=0 tid=0x00007f27c4225800 nid=0xc4 waiting on 
> condition [0x00007f2772bec000]
> "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27"
>  #143 daemon prio=5 os_prio=0 tid=0x00007f27c4365800 nid=0xb9 waiting on 
> condition [0x00007f27736f7000]
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to