[ https://issues.apache.org/jira/browse/KAFKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858064#comment-13858064 ]
Hanish Bansal commented on KAFKA-1193: -------------------------------------- If i don't perform 5th step (i.e. If there is only one broker node in isr list then wait for some time and again check isr status of topic. There should be 2 brokers in isr list.) listed properly then i am able to see logs like: {quote} [2013-12-23 10:25:07,648] DEBUG [OfflinePartitionLeaderSelector]: No broker in ISR is alive for [test-trunk111,1]. Pick the leader from the alive assigned replicas: 1 (kafka.controller.OfflinePartitionLeaderSelector) [2013-12-23 10:25:07,648] WARN [OfflinePartitionLeaderSelector]: No broker in ISR is alive for [test-trunk111,1]. Elect leader 1 from live brokers 1. There's potential data loss. (kafka.controller.OfflinePartitionLeaderSelector) [2013-12-23 10:25:07,649] INFO [OfflinePartitionLeaderSelector]: Selected new leader and ISR {"leader":1,"leader_epoch":1,"isr":[1]} for offline partition [test-trunk111,1] (kafka.controller.OfflinePartitionLeaderSelector) {quote} In this case where only one broker is in isr list i experienced 50-60 % data loss where is the case where both 2 brokers are in isr list i experienced only 2-3 % data loss. > Data loss if broker is killed using kill -9 > ------------------------------------------- > > Key: KAFKA-1193 > URL: https://issues.apache.org/jira/browse/KAFKA-1193 > Project: Kafka > Issue Type: Bug > Components: replication > Affects Versions: 0.8.0, 0.8.1 > Environment: Centos 6.3 > Reporter: Hanish Bansal > Assignee: Neha Narkhede > > We are having kafka cluster of 2 nodes. (Using Kafka 0.8.0 version) > Replication Factor: 2 > Number of partitions: 2 > Actual Behaviour: > ------------------------- > Out of two nodes, if leader node goes down then data lost happens. > Steps to Reproduce: > ------------------------------ > 1. Create a 2 node kafka cluster with replication factor 2 > 2. Start the Kafka cluster > 3. Create a topic lets say "test-trunk111" > 4. Restart any one node. > 5. Check topic status using kafka-list-topic tool. > topic isr status is: > topic: test-trunk111 partition: 0 leader: 0 replicas: 1,0 isr: 0,1 > topic: test-trunk111 partition: 1 leader: 0 replicas: 0,1 isr: 0,1 > If there is only one broker node in isr list then wait for some time and > again check isr status of topic. There should be 2 brokers in isr list. > 6. Start producing the data. > 7. Kill leader node (borker-0 in our case) meanwhile of data producing. > 8. After all data is produced start consumer. > 9. Observe the behaviour. There is data loss. > After leader goes down, topic isr status is: > topic: test-trunk111 partition: 0 leader: 1 replicas: 1,0 isr: 1 > topic: test-trunk111 partition: 1 leader: 1 replicas: 0,1 isr: 1 > We have tried below things to avoid data loss: > ---------------------------------------------------------------- > 1. Configured "request.required.acks=-1" in producer configuration because as > mentioned in documentation > http://kafka.apache.org/documentation.html#producerconfigs, setting this > value to -1 provides guarantee that no messages will be lost. > 2. Increased the "message.send.max.retries" from 3 to 10 in producer > configuration. > 3. Set "controlled.shutdown.enable" to true in broker configuration. > 4. Tested with Kafka-0.8.1 after applying patch KAFKA-1188.patch available on > https://issues.apache.org/jira/browse/KAFKA-1188 > Nothing work out from above things in case of leader node is killed using > "kill -9 <pid>". > Expected Behaviour: > ---------------------------- > No data should be lost. -- This message was sent by Atlassian JIRA (v6.1.5#6160)