You probably don't need to set replica.lag.max.messages that high. You can observe the max lag in jmx and set the value to be a bit higher than that.
Thanks, Jun On Fri, Jul 18, 2014 at 11:20 AM, Jiang Wu (Pricehistory) (BLOOMBERG/ 731 LEX -) <jwu...@bloomberg.net> wrote: > We tested ack=-1 with replica.lag.max.messages=1000000000000. In this > config no message loss was found. > > This is the only config we found to satisfy 1. no message loss and 2. > service keeps available when 1 single broker is down. Are there other > configs that can achieve the same, or stronger consistency while keep the > availability level? > > Thanks, > Jiang > > ----- Original Message ----- > From: wangg...@gmail.com > To: JIANG WU (PRICEHISTORY) (BLOOMBERG/ 731 LEX -), users@kafka.apache.org > At: Jul 16 2014 11:34:13 > > Selecting replicas in ISR based on their fetched messages would be quite > complicated since it requires the controller to keep track of this > information in sync. An alternative solution to your issue would either be > using ack=-1 or reduce replica.lag.time.max.ms so that followers not > keeping up closely will be dropped out of ISR more quickly. > > Guozhang > > > On Wed, Jul 16, 2014 at 5:44 AM, Jiang Wu (Pricehistory) (BLOOMBERG/ 731 > LEX -) <jwu...@bloomberg.net> wrote: > > > Guozhong, > > > > So this is the cause of message loss in my test where acks=2 and > > replicas=3: > > At one moment all 3 replicas, leader L, followers F1 and F2 are in ISR. A > > publisher sends a message m to L. F1 fetches m. Both L and F1 > acknowledge m > > so the send() is successful. Before F2 fetches m, L is killed and leader > > election takes place, and F2 is selected as the new leader. After F2 > > becomes the leader, it doesn't replicate m from F1, so consumers won't > > receive the message m. > > > > It seems to me that the election here is an unclean leader election that > > can be avoided. For example, instead of just choosing the first live > broker > > in the ISR as the new leader, choosing the one fetched more messages as > the > > new leader may avoid the message loss in the above scenario. Is this a > > feasible fix? > > > > Thanks, > > Jiang > > > > ----- Original Message ----- > > From: wangg...@gmail.com > > To: JIANG WU (PRICEHISTORY) (BLOOMBERG/ 731 LEX -) > > At: Jul 15 2014 16:30:56 > > > > That is true: when broker becomes a new leader it will stop replicating > > data from others. However, what you may want to do is tune the following > > configs so that replicas will not be easily dropping out of ISR under > high > > produce load: > > > > replica.lag.max.messages > > > > replica.lag.time.max.ms > > > > You can get their description here: > > > > http://kafka.apache.org/documentation.html#brokerconfigs > > > > Guozhang > > > > > > On Tue, Jul 15, 2014 at 1:25 PM, Jiang Wu (Pricehistory) (BLOOMBERG/ 731 > > LEX -) <jwu...@bloomberg.net> wrote: > > > > > When ack=-1 and the publisher thread number is high, it always happens > > > that only the leader remains in ISR and shutting down the leader will > > cause > > > message loss. > > > > > > The leader election code shows that the new leader will be the first > > alive > > > broker in the ISR list. So it's possible the new leader will be behind > > the > > > followers. > > > > > > It seems that after a broker becomes a leader, it stops replicating > from > > > others even when it hasn't received all available messages? > > > > > > Regards, > > > Jiang > > > > > > ----- Original Message ----- > > > From: wangg...@gmail.com > > > To: JIANG WU (PRICEHISTORY) (BLOOMBERG/ 731 LEX -), > > users@kafka.apache.org > > > At: Jul 15 2014 16:11:17 > > > > > > That could be the cause, and it can be verified by changing the acks to > > -1 > > > and checking the data loss ratio then. > > > > > > Guozhang > > > > > > > > > On Tue, Jul 15, 2014 at 12:49 PM, Jiang Wu (Pricehistory) (BLOOMBERG/ > 731 > > > LEX -) <jwu...@bloomberg.net> wrote: > > > > > > > Guozhang,My coworker came up with an explaination: at one moment the > > > > leader L, and two followers F1, F2 are all in ISR. The producer > sends a > > > > message m1 and receives acks from L and F1. Before the messge is > > > replicated > > > > to F2, L is down. In the following leader election, F2, instead of > F1, > > > > becomes the leader, and loses m1 somehow. > > > > Could that be the root cause? > > > > Thanks, > > > > Jiang > > > > > > > > From: users@kafka.apache.org At: Jul 15 2014 15:05:25 > > > > To: users@kafka.apache.org > > > > Subject: Re: message loss for sync producer, acks=2, topic replicas=3 > > > > > > > > Guozhang, > > > > > > > > Please find the config below: > > > > > > > > Producer: > > > > > > > > props.put("producer.type", "sync"); > > > > > > > > props.put("request.required.acks", 2); > > > > > > > > props.put("serializer.class", "kafka.serializer.StringEncoder"); > > > > > > > > props.put("partitioner.class", > "kafka.producer.DefaultPartitioner"); > > > > > > > > props.put("message.send.max.retries", "60"); > > > > > > > > props.put("retry.backoff.ms", "300"); > > > > > > > > Consumer: > > > > > > > > props.put("zookeeper.session.timeout.ms", "400"); > > > > > > > > props.put("zookeeper.sync.time.ms", "200"); > > > > > > > > props.put("auto.commit.interval.ms", "1000"); > > > > > > > > Broker: > > > > num.network.threads=2 > > > > num.io.threads=8 > > > > socket.send.buffer.bytes=1048576 > > > > socket.receive.buffer.bytes=1048576 > > > > socket.request.max.bytes=104857600 > > > > num.partitions=2 > > > > log.retention.hours=168 > > > > log.retention.bytes=20000000 > > > > log.segment.bytes=536870912 > > > > log.retention.check.interval.ms=60000 > > > > log.cleaner.enable=false > > > > zookeeper.connection.timeout.ms=1000000 > > > > > > > > Topic: > > > > Topic:p1r3 PartitionCount:1 ReplicationFactor:3 > > > > Configs:retention.bytes=10000000000 > > > > > > > > Thanks, > > > > Jiang > > > > > > > > From: users@kafka.apache.org At: Jul 15 2014 13:59:03 > > > > To: JIANG WU (PRICEHISTORY) (BLOOMBERG/ 731 LEX -), > > > users@kafka.apache.org > > > > Subject: Re: message loss for sync producer, acks=2, topic replicas=3 > > > > > > > > What config property values did you use on producer/consumer/broker? > > > > > > > > Guozhang > > > > > > > > > > > > On Tue, Jul 15, 2014 at 10:32 AM, Jiang Wu (Pricehistory) (BLOOMBERG/ > > 731 > > > > LEX -) <jwu...@bloomberg.net> wrote: > > > > > > > > > Guozhang, > > > > > I'm testing on 0.8.1.1; just kill pid, no -9. > > > > > Regards, > > > > > Jiang > > > > > > > > > > From: users@kafka.apache.org At: Jul 15 2014 13:27:50 > > > > > To: JIANG WU (PRICEHISTORY) (BLOOMBERG/ 731 LEX -), > > > > users@kafka.apache.org > > > > > Subject: Re: message loss for sync producer, acks=2, topic > replicas=3 > > > > > > > > > > Hello Jiang, > > > > > > > > > > Which version of Kafka are you using, and did you kill the broker > > with > > > > -9? > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > On Tue, Jul 15, 2014 at 9:23 AM, Jiang Wu (Pricehistory) > (BLOOMBERG/ > > > 731 > > > > > LEX -) <jwu...@bloomberg.net> wrote: > > > > > > > > > > > Hi, > > > > > > I observed some unexpected message loss in kafka fault tolerant > > test. > > > > In > > > > > > the test, a topic with 3 replicas is created. A sync producer > with > > > > acks=2 > > > > > > publishes to the topic. A consumer consumes from the topic and > > tracks > > > > > > message ids. During the test, the leader is killed. Both producer > > and > > > > > > consumer continue to run for a while. After the producer stops, > the > > > > > > consumer reports if all messages are received. > > > > > > > > > > > > The test was repeated multiple rounds; message loss happened in > > about > > > > 10% > > > > > > of the tests. A typical scenario is as follows: before the leader > > is > > > > > > killed, all 3 replicas are in ISR. After the leader is killed, > one > > > > > follower > > > > > > becomes the leader, and 2 replicas (including the new leader) are > > in > > > > ISR. > > > > > > Both the producer and consumer pause for several seconds during > > that > > > > > time, > > > > > > and then continue. Message loss happens after the leader is > killed. > > > > > > > > > > > > Because the new leader is in ISR before the old leader is killed, > > > > unclean > > > > > > leader election doesn't explain the message loss. > > > > > > > > > > > > I'm wondering if anyone else also observed such message loss? Is > > > there > > > > > any > > > > > > known issue that may cause the message loss in the above > scenario? > > > > > > > > > > > > Thanks, > > > > > > Jiang > > > > > > > > > > > > > > > -- > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > > > > > > > > -- > > -- Guozhang > > > > > > > -- > -- Guozhang > >