Re: preparing for the 0.8 final release
Hi Jun, any updates on this, there is a final release is planned soon? Thanks, Haithem On 13 Sep 2013, at 17:18, Jun Rao wrote: > Hi, Everyone, > > We have been stabilizing the 0.8 branch since the beta1 release. I think we > are getting close to an 0.8 final release. I made an initial list of the > remaining jiras that should be fixed in 0.8. > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%20%220.8%22%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22) > > 1. Do people agree with the list? > > 2. If the list is good, could people help contributing/reviewing the > remaining jiras? > > Thanks, > > Jun
[jira] [Commented] (KAFKA-1043) Time-consuming FetchRequest could block other request in the response queue
[ https://issues.apache.org/jira/browse/KAFKA-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771981#comment-13771981 ] Guozhang Wang commented on KAFKA-1043: -- IMHO the local time processing the fetch response is linear to # partitions in the request, while the network time writing the socket buffer is not, depending on whether the data is still in file cache or not. Hence following the 1) reset-socket-buffer-size or 2) subset-topic-partitions-at-a-time methods if we need either 1) set the buffer size too small which is unfair for other requests that do not hit I/O and may result in unnecessary round trips or 2) fetch too small a subset of topic-partitions which will be the same case as 1). Capping based on time is better since it provides "fairness" but that seems a little hacky. My reasoning of decoupling socket and network processor is the following. As we scale up the principle should be "various clients are isolated from each other". As for fetch request it would be "if you request old data from many topic partitions only your self-request should take long time but other requests should not be impacted". Today a request's life time as on server is socket -> network processor -> request handler -> (possible) disk I/O due to flush for produce request -> socket processor -> network I/O and one way to enable isolation is that no pair of this path is single-threaded. Today socket -> network processor is via acceptor, network processor -> request handler is via request queue, request handler -> (possible) disk I/O due to flush for produce request is fixed in KAFKA-615; but socket processor -> network I/O is still coupled, and fixes to issues resulted by this coupling would be taking care of the "worst case", which does not obey the "isolation" principle. I agree this is rather complex and would be a long term thing. > Time-consuming FetchRequest could block other request in the response queue > --- > > Key: KAFKA-1043 > URL: https://issues.apache.org/jira/browse/KAFKA-1043 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Guozhang Wang >Assignee: Guozhang Wang > Fix For: 0.8, 0.8.1 > > > Since in SocketServer the processor who takes any request is also responsible > for writing the response for that request, we make each processor owning its > own response queue. If a FetchRequest takes irregularly long time to write > the channel buffer it would block all other responses in the queue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Rebalancing failures during upgrade to latest code
The latest consumer changes to read data from Zookeeper during rebalance have made the consumer rebalance code incompatible with older versions (making rolling upgrades without downtime hard). The problem relates to how partitions are ordered. The old code seems to have returned the partitions sorted: ... rebalancing the following partitions: ArrayBuffer(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19) for topic produce-indexable-views with consumers: ... the new code instead uses: ... rebalancing the following partitions: List(0, 5, 10, 14, 1, 6, 9, 13, 2, 17, 12, 7, 3, 18, 16, 11, 8, 19, 4, 15) for topic produce-indexable-views with consumers: ... This causes new consumers and old consumers to claim the same partitions. I realize that this may not be a big deal (although painful for us since it disagrees with our deployment automation) since the code wasn't officially released, but it seems simple enough to sort the partitions if you'd take such a patch. /Sam
Re: preparing for the 0.8 final release
Based on the list above, we may be able to clear up the remaining jiras in roughly 3 weeks, so we can plan for a final release in a month or so. We would appreciate contributions and patches to close out the remaining jiras. Thanks Neha On Thu, Sep 19, 2013 at 4:15 AM, Haithem Jarraya wrote: > Hi Jun, > > any updates on this, there is a final release is planned soon? > > Thanks, > > Haithem > > On 13 Sep 2013, at 17:18, Jun Rao wrote: > > > Hi, Everyone, > > > > We have been stabilizing the 0.8 branch since the beta1 release. I think > we > > are getting close to an 0.8 final release. I made an initial list of the > > remaining jiras that should be fixed in 0.8. > > > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%20%220.8%22%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22) > > > > 1. Do people agree with the list? > > > > 2. If the list is good, could people help contributing/reviewing the > > remaining jiras? > > > > Thanks, > > > > Jun > >
Re: Rebalancing failures during upgrade to latest code
Agreed. This is a regression and is not easy to reason about. This is a side effect of reading the partitions as a set from zookeeper. Please can you file a JIRA to get this fixed? Feel free to upload a patch as well. Thanks, Neha On Thu, Sep 19, 2013 at 8:17 AM, Sam Meder wrote: > The latest consumer changes to read data from Zookeeper during rebalance > have made the consumer rebalance code incompatible with older versions > (making rolling upgrades without downtime hard). The problem relates to how > partitions are ordered. The old code seems to have returned the partitions > sorted: > > ... rebalancing the following partitions: ArrayBuffer(0, 1, 2, 3, 4, 5, 6, > 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19) for topic > produce-indexable-views with consumers: ... > > the new code instead uses: > > ... rebalancing the following partitions: List(0, 5, 10, 14, 1, 6, 9, 13, > 2, 17, 12, 7, 3, 18, 16, 11, 8, 19, 4, 15) for topic > produce-indexable-views with consumers: ... > > This causes new consumers and old consumers to claim the same partitions. > I realize that this may not be a big deal (although painful for us since it > disagrees with our deployment automation) since the code wasn't officially > released, but it seems simple enough to sort the partitions if you'd take > such a patch. > > /Sam > > > >
Re: Rebalancing failures during upgrade to latest code
Hello Sam, I agree that with the fix we still should sort the partition list before hand it to the assignment algorithm. I will try to make a follow-up patch to fix this. Guozhang On Thu, Sep 19, 2013 at 8:17 AM, Sam Meder wrote: > The latest consumer changes to read data from Zookeeper during rebalance > have made the consumer rebalance code incompatible with older versions > (making rolling upgrades without downtime hard). The problem relates to how > partitions are ordered. The old code seems to have returned the partitions > sorted: > > ... rebalancing the following partitions: ArrayBuffer(0, 1, 2, 3, 4, 5, 6, > 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19) for topic > produce-indexable-views with consumers: ... > > the new code instead uses: > > ... rebalancing the following partitions: List(0, 5, 10, 14, 1, 6, 9, 13, > 2, 17, 12, 7, 3, 18, 16, 11, 8, 19, 4, 15) for topic > produce-indexable-views with consumers: ... > > This causes new consumers and old consumers to claim the same partitions. > I realize that this may not be a big deal (although painful for us since it > disagrees with our deployment automation) since the code wasn't officially > released, but it seems simple enough to sort the partitions if you'd take > such a patch. > > /Sam > > > > -- -- Guozhang
[jira] Subscription: outstanding kafka patches
Issue Subscription Filter: outstanding kafka patches (68 issues) The list of outstanding kafka patches Subscriber: kafka-mailing-list Key Summary KAFKA-1049 Encoder implementations are required to provide an undocumented constructor. https://issues.apache.org/jira/browse/KAFKA-1049 KAFKA-1042 Fix segment flush logic https://issues.apache.org/jira/browse/KAFKA-1042 KAFKA-1032 Messages sent to the old leader will be lost on broker GC resulted failure https://issues.apache.org/jira/browse/KAFKA-1032 KAFKA-1020 Remove getAllReplicasOnBroker from KafkaController https://issues.apache.org/jira/browse/KAFKA-1020 KAFKA-1012 Implement an Offset Manager and hook offset requests to it https://issues.apache.org/jira/browse/KAFKA-1012 KAFKA-1011 Decompression and re-compression on MirrorMaker could result in messages being dropped in the pipeline https://issues.apache.org/jira/browse/KAFKA-1011 KAFKA-1008 Unmap before resizing https://issues.apache.org/jira/browse/KAFKA-1008 KAFKA-1005 kafka.perf.ConsumerPerformance not shutting down consumer https://issues.apache.org/jira/browse/KAFKA-1005 KAFKA-1004 Handle topic event for trivial whitelist topic filters https://issues.apache.org/jira/browse/KAFKA-1004 KAFKA-998 Producer should not retry on non-recoverable error codes https://issues.apache.org/jira/browse/KAFKA-998 KAFKA-997 Provide a strict verification mode when reading configuration properties https://issues.apache.org/jira/browse/KAFKA-997 KAFKA-996 Capitalize first letter for log entries https://issues.apache.org/jira/browse/KAFKA-996 KAFKA-984 Avoid a full rebalance in cases when a new topic is discovered but container/broker set stay the same https://issues.apache.org/jira/browse/KAFKA-984 KAFKA-982 Logo for Kafka https://issues.apache.org/jira/browse/KAFKA-982 KAFKA-981 Unable to pull Kafka binaries with Ivy https://issues.apache.org/jira/browse/KAFKA-981 KAFKA-976 Order-Preserving Mirror Maker Testcase https://issues.apache.org/jira/browse/KAFKA-976 KAFKA-967 Use key range in ProducerPerformance https://issues.apache.org/jira/browse/KAFKA-967 KAFKA-956 High-level consumer fails to check topic metadata response for errors https://issues.apache.org/jira/browse/KAFKA-956 KAFKA-946 Kafka Hadoop Consumer fails when verifying message checksum https://issues.apache.org/jira/browse/KAFKA-946 KAFKA-917 Expose zk.session.timeout.ms in console consumer https://issues.apache.org/jira/browse/KAFKA-917 KAFKA-885 sbt package builds two kafka jars https://issues.apache.org/jira/browse/KAFKA-885 KAFKA-881 Kafka broker not respecting log.roll.hours https://issues.apache.org/jira/browse/KAFKA-881 KAFKA-873 Consider replacing zkclient with curator (with zkclient-bridge) https://issues.apache.org/jira/browse/KAFKA-873 KAFKA-868 System Test - add test case for rolling controlled shutdown https://issues.apache.org/jira/browse/KAFKA-868 KAFKA-863 System Test - update 0.7 version of kafka-run-class.sh for Migration Tool test cases https://issues.apache.org/jira/browse/KAFKA-863 KAFKA-859 support basic auth protection of mx4j console https://issues.apache.org/jira/browse/KAFKA-859 KAFKA-855 Ant+Ivy build for Kafka https://issues.apache.org/jira/browse/KAFKA-855 KAFKA-854 Upgrade dependencies for 0.8 https://issues.apache.org/jira/browse/KAFKA-854 KAFKA-815 Improve SimpleConsumerShell to take in a max messages config option https://issues.apache.org/jira/browse/KAFKA-815 KAFKA-745 Remove getShutdownReceive() and other kafka specific code from the RequestChannel https://issues.apache.org/jira/browse/KAFKA-745 KAFKA-735 Add looping and JSON output for ConsumerOffsetChecker https://issues.apache.org/jira/browse/KAFKA-735 KAFKA-717 scala 2.10 build support https://issues.apache.org/jira/browse/KAFKA-717 KAFKA-686 0.8 Kafka broker should give a better error message when running against 0.7 zookeeper https://issues.apache.org/jira/browse/KAFKA-686 KAFKA-674 Clean Shutdown Testing - Log segments checksums mismatch https://issues.apache.org/jira/browse/KAFKA-674 KAFKA-652 Create testcases for clean shut-down https://issues.apache.org/jira/browse/KAFKA-652 KAFKA-649 Cleanup log4j logging https://issues.apache.org/jira/browse/KAFKA-649 KAFKA-645 Create a shell script to run System Test with DEBUG details and "tee" console output to a file https://issues.apache.org/jira/browse/KAFKA-645 KAFKA-598 decouple fetch size from max message size https://issues.apache.org/jira/browse/KAFKA-598 KAFKA-583 SimpleCons
[jira] [Commented] (KAFKA-1043) Time-consuming FetchRequest could block other request in the response queue
[ https://issues.apache.org/jira/browse/KAFKA-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771994#comment-13771994 ] Neha Narkhede commented on KAFKA-1043: -- As Sriram said, we no longer block on a full socket buffer. The problem is really large fetch requests, like those coming from a lagging mirror maker, hogging the network thread by writing as much as possible while the socket buffer is not full. This basically increases the response send time for all other requests whose responses are queued up behind this large fetch request. This causes a downward spiral that takes quite some time to recover due to the filled up response queues. We could cap based on size where we yield the network thread after n MBs are written on the wire, giving a chance for the rest of the smaller responses to get written on the socket. This will ensure a large or a few large fetch requests don't penalize several other smaller requests. > Time-consuming FetchRequest could block other request in the response queue > --- > > Key: KAFKA-1043 > URL: https://issues.apache.org/jira/browse/KAFKA-1043 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Guozhang Wang >Assignee: Guozhang Wang > Fix For: 0.8, 0.8.1 > > > Since in SocketServer the processor who takes any request is also responsible > for writing the response for that request, we make each processor owning its > own response queue. If a FetchRequest takes irregularly long time to write > the channel buffer it would block all other responses in the queue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-1061) Break-down sendTime to multipleSendTime
[ https://issues.apache.org/jira/browse/KAFKA-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang updated KAFKA-1061: - Fix Version/s: 0.8.1 > Break-down sendTime to multipleSendTime > --- > > Key: KAFKA-1061 > URL: https://issues.apache.org/jira/browse/KAFKA-1061 > Project: Kafka > Issue Type: Bug >Reporter: Guozhang Wang >Assignee: Guozhang Wang > Fix For: 0.8.1 > > > After KAFKA-1060 is done we would also like to break the sendTime to each > MultiSend's time and its corresponding send data size. > This is related to KAFKA-1043 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-1060) Break-down sendTime into responseQueueTime and the real sendTime
[ https://issues.apache.org/jira/browse/KAFKA-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang updated KAFKA-1060: - Fix Version/s: 0.8.1 > Break-down sendTime into responseQueueTime and the real sendTime > > > Key: KAFKA-1060 > URL: https://issues.apache.org/jira/browse/KAFKA-1060 > Project: Kafka > Issue Type: Bug >Reporter: Guozhang Wang >Assignee: Guozhang Wang > Fix For: 0.8.1 > > > Currently the responseSendTime in updateRequestMetrics actually contains two > portions, the responseQueueTime and the real SendTime. We would like to > distinguish these two cases. > This is related to KAFKA-1043 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (KAFKA-1060) Break-down sendTime into responseQueueTime and the real sendTime
Guozhang Wang created KAFKA-1060: Summary: Break-down sendTime into responseQueueTime and the real sendTime Key: KAFKA-1060 URL: https://issues.apache.org/jira/browse/KAFKA-1060 Project: Kafka Issue Type: Bug Reporter: Guozhang Wang Assignee: Guozhang Wang Currently the responseSendTime in updateRequestMetrics actually contains two portions, the responseQueueTime and the real SendTime. We would like to distinguish these two cases This is related to KAFKA-1043 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (KAFKA-1061) Break-down sendTime to multipleSendTime
Guozhang Wang created KAFKA-1061: Summary: Break-down sendTime to multipleSendTime Key: KAFKA-1061 URL: https://issues.apache.org/jira/browse/KAFKA-1061 Project: Kafka Issue Type: Bug Reporter: Guozhang Wang Assignee: Guozhang Wang After KAFKA-1060 is done we would also like to break the sendTime to each MultiSend's time and its corresponding send data size. This is related to KAFKA-1043 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-1060) Break-down sendTime into responseQueueTime and the real sendTime
[ https://issues.apache.org/jira/browse/KAFKA-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang updated KAFKA-1060: - Description: Currently the responseSendTime in updateRequestMetrics actually contains two portions, the responseQueueTime and the real SendTime. We would like to distinguish these two cases. This is related to KAFKA-1043 was: Currently the responseSendTime in updateRequestMetrics actually contains two portions, the responseQueueTime and the real SendTime. We would like to distinguish these two cases This is related to KAFKA-1043 > Break-down sendTime into responseQueueTime and the real sendTime > > > Key: KAFKA-1060 > URL: https://issues.apache.org/jira/browse/KAFKA-1060 > Project: Kafka > Issue Type: Bug >Reporter: Guozhang Wang >Assignee: Guozhang Wang > > Currently the responseSendTime in updateRequestMetrics actually contains two > portions, the responseQueueTime and the real SendTime. We would like to > distinguish these two cases. > This is related to KAFKA-1043 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-1018) tidy up the POM from what feedback has come from the 0.8 beta and publishing to maven
[ https://issues.apache.org/jira/browse/KAFKA-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772450#comment-13772450 ] Neha Narkhede commented on KAFKA-1018: -- [~joestein] This is marked for the 0.8 final release. Do you think you could help look into this? > tidy up the POM from what feedback has come from the 0.8 beta and publishing > to maven > - > > Key: KAFKA-1018 > URL: https://issues.apache.org/jira/browse/KAFKA-1018 > Project: Kafka > Issue Type: Bug >Reporter: Joe Stein > Fix For: 0.8 > > > from Chris Riccomini > 1. Maven central can't resolve it properly (POM is different from Apache > release). Have to use Apache release repo directly to get things to work. > 2. Exclusions must be manually applied even though they exist in Kafka's POM > already. I think Maven can handle this automatically, if the POM is done > right. > 3. Weird parent block in Kafka POMs that points to org.apache. > 4. Would be nice to publish kafka-test jars as well. > 5. Would be nice to have SNAPSHOT releases off of trunk using a Hudson job. > Our hypothesis regarding the first issue is that it was caused by duplicate > publishing during testing, and it should go away in the future. > Regarding number 2, I have to explicitly exclude the following when depending > on Kafka: > exclude module: 'jms' > exclude module: 'jmxtools' > exclude module: 'jmxri' > I believe these just need to be excluded from the appropriate jars in the > actual SBT build file, to fix this issue. I see JMS is excluded from ZK, but > it's probably being pulled in from somewhere else, anyway. > Regarding number 3, it is indeed listed as something to do on the Apache > publication page (http://www.apache.org/dev/publishing-maven-artifacts.html). > I can't find an example of anyone using it, but it doesn't seem to be doing > any harm. > Also, regarding your intransitive() call, that is disabling ALL dependencies > not just the exclusions, I believe. I think that the "proper" way to do that > would be to do what I've done: exclude("jms", "jmxtools", "jmxri"). > Regardless, fixing number 2, above, should mean that intransitive()/exclude() > are not required. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-956) High-level consumer fails to check topic metadata response for errors
[ https://issues.apache.org/jira/browse/KAFKA-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772452#comment-13772452 ] Neha Narkhede commented on KAFKA-956: - [~smeder] KAFKA-1030 is now checked in and the problem reported in this JIRA should be fixed. Can you confirm that and close this JIRA? > High-level consumer fails to check topic metadata response for errors > - > > Key: KAFKA-956 > URL: https://issues.apache.org/jira/browse/KAFKA-956 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.8 >Reporter: Sam Meder >Assignee: Neha Narkhede >Priority: Blocker > Fix For: 0.8 > > Attachments: consumer_metadata_fetch.patch > > > In our environment we noticed that consumers would sometimes hang when > started too close to starting the Kafka server. I tracked this down and it > seems to be related to some code in rebalance > (ZookeeperConsumerConnector.scala). In particular the following code seems > problematic: > val topicsMetadata = > ClientUtils.fetchTopicMetadata(myTopicThreadIdsMap.keySet, > brokers, > config.clientId, > > config.socketTimeoutMs, > > correlationId.getAndIncrement).topicsMetadata > val partitionsPerTopicMap = new mutable.HashMap[String, Seq[Int]] > topicsMetadata.foreach(m => { > val topic = m.topic > val partitions = m.partitionsMetadata.map(m1 => m1.partitionId) > partitionsPerTopicMap.put(topic, partitions) > }) > The response is never checked for error, so may not actually contain any > partition info! Rebalance goes its merry way, but doesn't know about any > partitions so never assigns them... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-1008) Unmap before resizing
[ https://issues.apache.org/jira/browse/KAFKA-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772459#comment-13772459 ] Neha Narkhede commented on KAFKA-1008: -- ping [~lizziew], [~jkreps]. Could you address Jun's review comments and see if we can resolve this JIRA? This is marked for the 0.8 final release > Unmap before resizing > - > > Key: KAFKA-1008 > URL: https://issues.apache.org/jira/browse/KAFKA-1008 > Project: Kafka > Issue Type: Bug > Components: core, log >Affects Versions: 0.8 > Environment: Windows, Linux, Mac OS >Reporter: Elizabeth Wei >Assignee: Jay Kreps > Labels: patch > Fix For: 0.8 > > Attachments: KAFKA-0.8-1008-v7.patch, KAFKA-1008-v6.patch, > KAFKA-trunk-1008-v7.patch, unmap-v5.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > While I was studying how MappedByteBuffer works, I saw a sharing runtime > exception on Windows. I applied what I learned to generate a patch which uses > an internal open JDK API to solve this problem. > Following Jay's advice, I made a helper method called tryUnmap(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-956) High-level consumer fails to check topic metadata response for errors
[ https://issues.apache.org/jira/browse/KAFKA-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Meder updated KAFKA-956: Resolution: Fixed Status: Resolved (was: Patch Available) I can confirm that this issue was fixed by KAFKA-1030 > High-level consumer fails to check topic metadata response for errors > - > > Key: KAFKA-956 > URL: https://issues.apache.org/jira/browse/KAFKA-956 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.8 >Reporter: Sam Meder >Assignee: Neha Narkhede >Priority: Blocker > Fix For: 0.8 > > Attachments: consumer_metadata_fetch.patch > > > In our environment we noticed that consumers would sometimes hang when > started too close to starting the Kafka server. I tracked this down and it > seems to be related to some code in rebalance > (ZookeeperConsumerConnector.scala). In particular the following code seems > problematic: > val topicsMetadata = > ClientUtils.fetchTopicMetadata(myTopicThreadIdsMap.keySet, > brokers, > config.clientId, > > config.socketTimeoutMs, > > correlationId.getAndIncrement).topicsMetadata > val partitionsPerTopicMap = new mutable.HashMap[String, Seq[Int]] > topicsMetadata.foreach(m => { > val topic = m.topic > val partitions = m.partitionsMetadata.map(m1 => m1.partitionId) > partitionsPerTopicMap.put(topic, partitions) > }) > The response is never checked for error, so may not actually contain any > partition info! Rebalance goes its merry way, but doesn't know about any > partitions so never assigns them... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (KAFKA-956) High-level consumer fails to check topic metadata response for errors
[ https://issues.apache.org/jira/browse/KAFKA-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Meder closed KAFKA-956. --- > High-level consumer fails to check topic metadata response for errors > - > > Key: KAFKA-956 > URL: https://issues.apache.org/jira/browse/KAFKA-956 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.8 >Reporter: Sam Meder >Assignee: Neha Narkhede >Priority: Blocker > Fix For: 0.8 > > Attachments: consumer_metadata_fetch.patch > > > In our environment we noticed that consumers would sometimes hang when > started too close to starting the Kafka server. I tracked this down and it > seems to be related to some code in rebalance > (ZookeeperConsumerConnector.scala). In particular the following code seems > problematic: > val topicsMetadata = > ClientUtils.fetchTopicMetadata(myTopicThreadIdsMap.keySet, > brokers, > config.clientId, > > config.socketTimeoutMs, > > correlationId.getAndIncrement).topicsMetadata > val partitionsPerTopicMap = new mutable.HashMap[String, Seq[Int]] > topicsMetadata.foreach(m => { > val topic = m.topic > val partitions = m.partitionsMetadata.map(m1 => m1.partitionId) > partitionsPerTopicMap.put(topic, partitions) > }) > The response is never checked for error, so may not actually contain any > partition info! Rebalance goes its merry way, but doesn't know about any > partitions so never assigns them... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (KAFKA-1062) Reading topic metadata from zookeeper leads to incompatible ordering of partition list
Sam Meder created KAFKA-1062: Summary: Reading topic metadata from zookeeper leads to incompatible ordering of partition list Key: KAFKA-1062 URL: https://issues.apache.org/jira/browse/KAFKA-1062 Project: Kafka Issue Type: Bug Components: consumer Affects Versions: 0.8 Reporter: Sam Meder Assignee: Neha Narkhede The latest consumer changes to read data from Zookeeper during rebalance have made the consumer rebalance code incompatible with older versions (making rolling upgrades without downtime hard). The problem relates to how partitions are ordered. The old code seems to have returned the partitions sorted: ... rebalancing the following partitions: ArrayBuffer(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19) for topic produce-indexable-views with consumers: ... the new code instead uses: ... rebalancing the following partitions: List(0, 5, 10, 14, 1, 6, 9, 13, 2, 17, 12, 7, 3, 18, 16, 11, 8, 19, 4, 15) for topic produce-indexable-views with consumers: ... This causes new consumers and old consumers to claim the same partitions. I realize that this may not be a big deal (although painful for us since it disagrees with our deployment automation) since the code wasn't officially released, but it seems simple enough to sort the partitions if you'd take such a patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-1062) Reading topic metadata from zookeeper leads to incompatible ordering of partition list
[ https://issues.apache.org/jira/browse/KAFKA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Meder updated KAFKA-1062: - Attachment: sorted.patch > Reading topic metadata from zookeeper leads to incompatible ordering of > partition list > -- > > Key: KAFKA-1062 > URL: https://issues.apache.org/jira/browse/KAFKA-1062 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.8 >Reporter: Sam Meder >Assignee: Neha Narkhede > Attachments: sorted.patch > > > The latest consumer changes to read data from Zookeeper during rebalance have > made the consumer rebalance code incompatible with older versions (making > rolling upgrades without downtime hard). The problem relates to how > partitions are ordered. The old code seems to have returned the partitions > sorted: > ... rebalancing the following partitions: ArrayBuffer(0, 1, 2, 3, 4, 5, 6, 7, > 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19) for topic > produce-indexable-views with consumers: ... > the new code instead uses: > ... rebalancing the following partitions: List(0, 5, 10, 14, 1, 6, 9, 13, 2, > 17, 12, 7, 3, 18, 16, 11, 8, 19, 4, 15) for topic produce-indexable-views > with consumers: ... > This causes new consumers and old consumers to claim the same partitions. I > realize that this may not be a big deal (although painful for us since it > disagrees with our deployment automation) since the code wasn't officially > released, but it seems simple enough to sort the partitions if you'd take > such a patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-1062) Reading topic metadata from zookeeper leads to incompatible ordering of partition list
[ https://issues.apache.org/jira/browse/KAFKA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Meder updated KAFKA-1062: - Status: Patch Available (was: Open) > Reading topic metadata from zookeeper leads to incompatible ordering of > partition list > -- > > Key: KAFKA-1062 > URL: https://issues.apache.org/jira/browse/KAFKA-1062 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.8 >Reporter: Sam Meder >Assignee: Neha Narkhede > Attachments: sorted.patch > > > The latest consumer changes to read data from Zookeeper during rebalance have > made the consumer rebalance code incompatible with older versions (making > rolling upgrades without downtime hard). The problem relates to how > partitions are ordered. The old code seems to have returned the partitions > sorted: > ... rebalancing the following partitions: ArrayBuffer(0, 1, 2, 3, 4, 5, 6, 7, > 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19) for topic > produce-indexable-views with consumers: ... > the new code instead uses: > ... rebalancing the following partitions: List(0, 5, 10, 14, 1, 6, 9, 13, 2, > 17, 12, 7, 3, 18, 16, 11, 8, 19, 4, 15) for topic produce-indexable-views > with consumers: ... > This causes new consumers and old consumers to claim the same partitions. I > realize that this may not be a big deal (although painful for us since it > disagrees with our deployment automation) since the code wasn't officially > released, but it seems simple enough to sort the partitions if you'd take > such a patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Rebalancing failures during upgrade to latest code
Filed KAFKA-1062, including trivial patch. /Sam On Sep 19, 2013, at 5:52 PM, Neha Narkhede wrote: > Agreed. This is a regression and is not easy to reason about. This is a > side effect of reading the partitions as a set from zookeeper. Please can > you file a JIRA to get this fixed? Feel free to upload a patch as well. > > Thanks, > Neha > > > On Thu, Sep 19, 2013 at 8:17 AM, Sam Meder wrote: > >> The latest consumer changes to read data from Zookeeper during rebalance >> have made the consumer rebalance code incompatible with older versions >> (making rolling upgrades without downtime hard). The problem relates to how >> partitions are ordered. The old code seems to have returned the partitions >> sorted: >> >> ... rebalancing the following partitions: ArrayBuffer(0, 1, 2, 3, 4, 5, 6, >> 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19) for topic >> produce-indexable-views with consumers: ... >> >> the new code instead uses: >> >> ... rebalancing the following partitions: List(0, 5, 10, 14, 1, 6, 9, 13, >> 2, 17, 12, 7, 3, 18, 16, 11, 8, 19, 4, 15) for topic >> produce-indexable-views with consumers: ... >> >> This causes new consumers and old consumers to claim the same partitions. >> I realize that this may not be a big deal (although painful for us since it >> disagrees with our deployment automation) since the code wasn't officially >> released, but it seems simple enough to sort the partitions if you'd take >> such a patch. >> >> /Sam >> >> >> >>
[jira] [Commented] (KAFKA-1062) Reading topic metadata from zookeeper leads to incompatible ordering of partition list
[ https://issues.apache.org/jira/browse/KAFKA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772634#comment-13772634 ] Guozhang Wang commented on KAFKA-1062: -- +1. Thanks for the patch. Guozhang > Reading topic metadata from zookeeper leads to incompatible ordering of > partition list > -- > > Key: KAFKA-1062 > URL: https://issues.apache.org/jira/browse/KAFKA-1062 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.8 >Reporter: Sam Meder >Assignee: Neha Narkhede > Attachments: sorted.patch > > > The latest consumer changes to read data from Zookeeper during rebalance have > made the consumer rebalance code incompatible with older versions (making > rolling upgrades without downtime hard). The problem relates to how > partitions are ordered. The old code seems to have returned the partitions > sorted: > ... rebalancing the following partitions: ArrayBuffer(0, 1, 2, 3, 4, 5, 6, 7, > 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19) for topic > produce-indexable-views with consumers: ... > the new code instead uses: > ... rebalancing the following partitions: List(0, 5, 10, 14, 1, 6, 9, 13, 2, > 17, 12, 7, 3, 18, 16, 11, 8, 19, 4, 15) for topic produce-indexable-views > with consumers: ... > This causes new consumers and old consumers to claim the same partitions. I > realize that this may not be a big deal (although painful for us since it > disagrees with our deployment automation) since the code wasn't officially > released, but it seems simple enough to sort the partitions if you'd take > such a patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira