[ 
https://issues.apache.org/jira/browse/KAFKA-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660269#comment-13660269
 ] 

Neha Narkhede edited comment on KAFKA-901 at 5/17/13 3:53 PM:
--------------------------------------------------------------

Changes in the latest patch include -

1. Removed the all brokers and just included alive brokers in the update 
metadata request. So the topic metadata will not include broker information for 
dead brokers.

2. My guess about there being a bug that with the update metadata request 
processing was right. The bug doesn't affect the correctness of update 
metadata, but it just delays the communication of new leaders to all brokers. 
The bug was that we were removing the broker going through controlled shutdown 
from the alive brokers list before it is really shutdown. So from a client's 
perspective, it takes much longer for a new leader to be available. Fixed it to 
include shutting down brokers in the list of alive brokers

3. Fixed another bug related to new topic creation. This bug caused the 
controller to not communicate the leaders of newly created topics to all 
brokers causing metadata requests to fail.

4. Tested this fix with 100s of migration tools sending data to ~400 topics to 
a 7 node cluster. There are ~500 consumers consuming data from this cluster. 
The test continuously bounces the brokers in a rolling restart fashion. The 
clients notice the new leaders within few 10s of ms in most cases.

5. Also the queue time for all requests is mostly < 10 ms since metadata 
requests are not a bottleneck in the system anymore. The latency of a metadata 
request for ~300 topics itself has dropped from 10s of seconds to 10s of ms.
                
      was (Author: nehanarkhede):
    Changes in the latest patch include -

1. Removed the all brokers and just included alive brokers in the update 
metadata request. So the topic metadata will not include broker information

2. My guess about their being a bug that with the update metadata request 
processing was right. The bug doesn't affect the correctness of update 
metadata, but it just delays the communication of new leaders to all brokers. 
The bug was that we were removing the broker going through controlled shutdown 
from the alive brokers list before it is really shutdown. So from a client's 
perspective, it takes much longer for a new leader to be available. Fixed it to 
include shutting down brokers in the list of alive brokers

3. Fixed another bug related to new topic creation. This bug caused the 
controller to not communicate the leaders of newly created topics to all 
brokers causing metadata requests to fail.

4. Tested this fix with 100s of migration tools sending data to ~400 topics to 
a 7 node cluster. There are ~500 consumers consuming data from this cluster. 
The test continuously bounces the brokers in a rolling restart fashion. The 
clients notice the new leaders within few 10s of ms in most cases.

5. Also the queue time for all requests is mostly < 10 ms since metadata 
requests are not a bottleneck in the system anymore. The latency of a metadata 
request for ~300 topics itself has dropped from 10s of seconds to 10s of ms.
                  
> Kafka server can become unavailable if clients send several metadata requests
> -----------------------------------------------------------------------------
>
>                 Key: KAFKA-901
>                 URL: https://issues.apache.org/jira/browse/KAFKA-901
>             Project: Kafka
>          Issue Type: Bug
>          Components: replication
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>            Assignee: Neha Narkhede
>            Priority: Blocker
>         Attachments: kafka-901.patch, kafka-901-v2.patch, kafka-901-v4.patch, 
> metadata-request-improvement.patch
>
>
> Currently, if a broker is bounced without controlled shutdown and there are 
> several clients talking to the Kafka cluster, each of the clients realize the 
> unavailability of leaders for some partitions. This leads to several metadata 
> requests sent to the Kafka brokers. Since metadata requests are pretty slow, 
> all the I/O threads quickly become busy serving the metadata requests. This 
> leads to a full request queue, that stalls handling of finished responses 
> since the same network thread handles requests as well as responses. In this 
> situation, clients timeout on metadata requests and send more metadata 
> requests. This quickly makes the Kafka cluster unavailable. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to