Consumer Rebalance issue and follow-up on KAFKA-9752
Hi, This is regards to https://issues.apache.org/jira/browse/KAFKA-9752 issue where Consumer rebalance can be stuck after new member timeout with old JoinGroup version. We have taken the fix, but now see a different issue . Earlier the ConsumerGroup was stuck in "PendingRebalance" state , which is not happening now , but now I see members not able to join the group . I see below logs where members are being removed after session timeout. [2020-08-09 09:29:00,558] INFO [GroupCoordinator 5]: Pending member XXX in group YYY has been removed after session timeout expiration. (kafka.coordinator.group.GroupCoordinator) [2020-08-09 09:29:55,856] INFO [GroupCoordinator 5]: Pending member ZZZ in group YYY has been removed after session timeout expiration. (kafka.coordinator.group.GroupCoordinator) As I see the GroupCoridinator code, when new member tries to join for first time, GroupCoridinator also schedule addPendingMemberExpiration (in doUnknownJoinGroup call ) with *SessionTimeOut*… If for some reason , addMemberAndRebalance call takes longer (longer than SessionTimeOut), and members are still in “Pending” state, the above addPendingMemberExpiration can remove the pending member and they cannot join the group. I think that is what is happening. When for new member , Coordinator is already setting a timeout in completeAndScheduleNextExpiration(group, member, *NewMemberJoinTimeoutMs* ) What is the requirement for one more addPendingMemberExpiration task for new members ? Is this a possible bug ?
Kafka Simple Consumer API for 0.9
Hi, Is the Simple Consumer API will change in Kafka 0.9 ? I can see a Consumer Re-design approach for Kafka 0.9 , which I believe will not impact any client written using Simple Consumer API . Is that correct ? Regards, Dibyendu
Re: Kafka/Hadoop consumers and producers
Hi Ken, I am also working on making the Camus fit for Non Avro message for our requirement. I see you mentioned about this patch (https://github.com/linkedin/camus/commit/87917a2aea46da9d21c8f67129f6463af52f7aa8) which supports custom data writer for Camus. But this patch is not pulled into camus-kafka-0.8 branch. Is there any plan for doing the same ? Regards, Dibyendu
[jira] [Commented] (KAFKA-3083) a soft failure in controller may leave a topic partition in an inconsistent state
[ https://issues.apache.org/jira/browse/KAFKA-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974338#comment-15974338 ] Dibyendu Bhattacharya commented on KAFKA-3083: -- Hi [~junrao] We also had this issue in Kafka 0.9.x. Any idea when this can be fixed. > a soft failure in controller may leave a topic partition in an inconsistent > state > - > > Key: KAFKA-3083 > URL: https://issues.apache.org/jira/browse/KAFKA-3083 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.9.0.0 >Reporter: Jun Rao >Assignee: Mayuresh Gharat > Labels: reliability > > The following sequence can happen. > 1. Broker A is the controller and is in the middle of processing a broker > change event. As part of this process, let's say it's about to shrink the isr > of a partition. > 2. Then broker A's session expires and broker B takes over as the new > controller. Broker B sends the initial leaderAndIsr request to all brokers. > 3. Broker A continues by shrinking the isr of the partition in ZK and sends > the new leaderAndIsr request to the broker (say C) that leads the partition. > Broker C will reject this leaderAndIsr since the request comes from a > controller with an older epoch. Now we could be in a situation that Broker C > thinks the isr has all replicas, but the isr stored in ZK is different. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (KAFKA-948) ISR list in LeaderAndISR path not updated for partitions when Broker (which is not leader) is down
Dibyendu Bhattacharya created KAFKA-948: --- Summary: ISR list in LeaderAndISR path not updated for partitions when Broker (which is not leader) is down Key: KAFKA-948 URL: https://issues.apache.org/jira/browse/KAFKA-948 Project: Kafka Issue Type: Bug Components: controller Affects Versions: 0.8 Reporter: Dibyendu Bhattacharya Assignee: Neha Narkhede When the broker which is the leader for a partition is down, the ISR list in the LeaderAndISR path is updated. But if the broker , which is not a leader of the partition is down, the ISR list is not getting updated. This is an issues because ISR list contains the stale entry. This issue I found in kafka-0.8.0-beta1-candidate1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-948) ISR list in LeaderAndISR path not updated for partitions when Broker (which is not leader) is down
[ https://issues.apache.org/jira/browse/KAFKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688871#comment-13688871 ] Dibyendu Bhattacharya commented on KAFKA-948: - I have tested a Fix, which seems to be working fine. For ReplicaStateMachine.scala, the handleStateChange method for offlineReplica case, the currLeaderIsrAndControllerEpoch.leaderAndIsr.isr contains the stale entry. A call to updateLeaderAndIsrCache() before will solve this issue. Shall I go ahead and make a pull request with this fix ? Dibyendu > ISR list in LeaderAndISR path not updated for partitions when Broker (which > is not leader) is down > -- > > Key: KAFKA-948 > URL: https://issues.apache.org/jira/browse/KAFKA-948 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.8 > Reporter: Dibyendu Bhattacharya >Assignee: Neha Narkhede > > When the broker which is the leader for a partition is down, the ISR list in > the LeaderAndISR path is updated. But if the broker , which is not a leader > of the partition is down, the ISR list is not getting updated. This is an > issues because ISR list contains the stale entry. > This issue I found in kafka-0.8.0-beta1-candidate1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira