Hi Guozhang, Nope, I was not using exactly-once mode. I dont have the client logs with me right now, I will try to replicate it again and share the other details with you.
My concern was that it crashed my brokers as well. -Sameer. On Sat, Sep 9, 2017 at 1:51 AM, Guozhang Wang <wangg...@gmail.com> wrote: > Hello Sameer, > > I looked through your code, and here is what I figured: in 0.11 version we > added the exactly-once feature ( > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > 98+-+Exactly+Once+Delivery+and+Transactional+Messaging > ) > > Which uses the transaction log (internal topic named "__transaction_state") > that has a default replication of 3 (that will overwrite your global config > value of 2). Then at around 12:30, the leader of the transation log > partition kicked both replicas of 190 and 192 out of the replica: > > [2017-09-05 12:30:31,256] INFO [GroupCoordinator 193]: Preparing to > rebalance group KafkaCache_TEST15 with old generation 14 > (__consumer_offsets-27) (kafka.coordinator.group.GroupCoordinator) > > *[2017-09-05 12:30:41,510] INFO Partition [__transaction_state,9] on broker > 193: Shrinking ISR from 193,192,190 to 193 (kafka.cluster.Partition)* > > *[2017-09-05 12:30:41,513] INFO Partition [__transaction_state,6] on broker > 193: Shrinking ISR from 193,190,192 to 193 (kafka.cluster.Partition)* > > *[2017-09-05 12:30:41,514] INFO Partition [__transaction_state,3] on broker > 193: Shrinking ISR from 193,192,190 to 193 (kafka.cluster.Partition)* > > *[2017-09-05 12:30:41,515] INFO Partition [__transaction_state,18] on > broker 193: Shrinking ISR from 193,190,192 to 193 > (kafka.cluster.Partition)* > > *[2017-09-05 12:30:41,515] INFO Partition [__transaction_state,15] on > broker 193: Shrinking ISR from 193,192,190 to 193 > (kafka.cluster.Partition)* > > *[2017-09-05 12:30:41,516] INFO Partition [__transaction_state,12] on > broker 193: Shrinking ISR from 193,190,192 to 193 > (kafka.cluster.Partition)* > > *[2017-09-05 12:30:41,516] INFO Partition [__consumer_offsets,12] on broker > 193: Shrinking ISR from 193,192,190 to 193 (kafka.cluster.Partition)* > > *[2017-09-05 12:30:41,517] INFO Partition [__consumer_offsets,15] on broker > 193: Shrinking ISR from 193,192,190 to 193 (kafka.cluster.Partition)* > > *[2017-09-05 12:30:41,517] INFO Partition [__transaction_state,24] on > broker 193: Shrinking ISR from 193,190,192 to 193 > (kafka.cluster.Partition)* > > At the mean time, both replicas of 190 and 192 seems to be timed out on > their fetch requests (note the big timestamp gap in the logs): > > [2017-09-05 12:26:21,130] INFO Rolled new log segment for 'AdServe-4' in 1 > ms. (kafka.log.Log) > [2017-09-05 12:30:59,046] WARN [ReplicaFetcherThread-2-193]: Error in fetch > to broker 193, request (type=FetchRequest, replicaId=190, maxWait=500, > minBytes=1, maxBytes=10485760, fetchData={__consumer_offsets-21=(offset=0, > logStartOffset=0, maxBytes=1048576) > > ... > > [2017-09-05 12:28:37,514] INFO Deleting index > /data1/kafka/AdServe-5/00000000000405000294.timeindex.deleted > (kafka.log.TimeIndex) > [2017-09-05 12:30:59,042] WARN [ReplicaFetcherThread-2-193]: Error in fetch > to broker 193, request (type=FetchRequest, replicaId=192, maxWait=500, > minBytes=1, maxBytes=10485760, fetchData={__consumer_offsets-21=(offset=0, > logStartOffset=0, maxBytes=1048576) > > > > This caused the NotEnoughReplicasException since any appends to the > transaction logs are required "acks=all, and min.isr=num.replicas". > > *[2017-09-05 12:32:11,612] ERROR [Replica Manager on Broker 193]: Error > processing append operation on partition __transaction_state-18 > (kafka.server.ReplicaManager)* > > *org.apache.kafka.common.errors.NotEnoughReplicasException: Number of > insync replicas for partition __transaction_state-18 is [1], below required > minimum [3]* > > Upon seeing this error, the transaction coordinator should retry appending, > but if the retry never succeeds it will be blocked. I did not see the > Streams API client-side logs and so cannot tell for sure, why this caused > the Streams app to fail as well. A quick question: did you enable > `processing.mode=exactly-once` on your streams app? > > > Guozhang > > > > > On Fri, Sep 8, 2017 at 1:34 AM, Sameer Kumar <sam.kum.w...@gmail.com> > wrote: > > > Hi All, > > > > > > Any thoughts on the below mail. > > > > -Sameer. > > > > On Wed, Sep 6, 2017 at 12:28 PM, Sameer Kumar <sam.kum.w...@gmail.com> > > wrote: > > > > > Hi All, > > > > > > I want to report a scenario wherein my running 2 different instances of > > my > > > stream application caused my brokers to crash and eventually my stream > > > application as well. This scenario only happens when my brokers run on > > > Kafka11, everything works fine if my brokers are on Kafka 10..2 and > > stream > > > application on Kafka11. > > > > > > I am attaching herewith the logs in a zipped format. > > > > > > The cluster configuration > > > 3 nodes(190,192,193) , Kafka 11 > > > Topic Replication Factor - 2 > > > > > > App configuration > > > Kafka 11 streams. > > > > > > > > > The error I saw on 193 server was org.apache.kafka.common.errors. > > NotEnoughReplicasException: > > > Number of insync replicas for partition __transaction_state-18 is [1], > > > below required minimum [3]. Both 192,190 servers reported errors on > > failure > > > to read information from 193. > > > > > > Please look for the time around 12:30-12:32 to find the relevant logs. > > Let > > > me know if you need some other information. > > > > > > > > > Regards, > > > -Sameer. > > > > > > > > > -- > -- Guozhang >