[ https://issues.apache.org/jira/browse/KAFKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058568#comment-14058568 ]
Andrey Stepachev edited comment on KAFKA-1530 at 7/11/14 9:00 AM: ------------------------------------------------------------------ Looks like [~ovgolovin] problem with wrong replica election can be fixed by adding notion of min-replicas somewhere around that code https://github.com/apache/kafka/blob/3c4ca854fd2da5e5fcecdaf0856a38a9ebe4763c/core/src/main/scala/kafka/cluster/Partition.scala#L165, we can restrict leader election/reelection only for partitions which have configured size of isr. According to [~renew] situation, it is not realistic to loose data in situation when leader is stopped and one of the replica will became the leader and _if_ acks required greater then 1. kafka maintains 'high watermark' for each partition and for each request it waits for required replicas to catch up with leader before responds to client. So if it is not a correlated failure (when we loose 2 replicas at once) it will work correctly. If it was 2 replicas and 1 replica outside of sir, both in ISR die, then it is possible to bring up third replica and new data in original ISR replicas data will be lost. Just to be sure, kafka is a 'primary backup' replication system, so in doesn't tolerate correlated failures in oppose to quorum system. But gives high throughput in return. That how in stands :) was (Author: octo47): Looks like [~ovgolovin] problem with wrong replica election can be fixed by adding notion of min-replicas somewhere around that code https://github.com/apache/kafka/blob/3c4ca854fd2da5e5fcecdaf0856a38a9ebe4763c/core/src/main/scala/kafka/cluster/Partition.scala#L165, we can restrict leader election/reelection only for partitions which have configured size of isr. According to [~renew] situation, it is not realistic to loose data in situation when leader is stopped and one of the replica will became the leader and _if_ acks required greater then 1. kafka maintains 'high watermark' for each partition and for each request it waits for required replicas to catch up with leader before responds to client. So if it is not a correlated failure (when we loose 2 replicas at once) it will work correctly. If it was 2 replicas and 1 replica outside of isr, both in ISR die, then it is possible to bring up third replica and new data in those replicas data will be lost. Just to be sure, kafka is a 'primary backup' replication system, so in doesn't tolerate correlated failures in oppose to quorum system. But gives high throughput in return. That how in stands :) > howto update continuously > ------------------------- > > Key: KAFKA-1530 > URL: https://issues.apache.org/jira/browse/KAFKA-1530 > Project: Kafka > Issue Type: Wish > Reporter: Stanislav Gilmulin > Assignee: Guozhang Wang > Priority: Minor > Labels: operating_manual, performance > > Hi, > > Could I ask you a question about the Kafka update procedure? > Is there a way to update software, which doesn't require service interruption > or lead to data losses? > We can't stop message brokering during the update as we have a strict SLA. > > Best regards > Stanislav Gilmulin -- This message was sent by Atlassian JIRA (v6.2#6252)