Yeah just pops up in my list. Thanks, i'll take a look. Vincent Dautremont, if you still reading it, did you try upgrade to 0.11.0.1? Fixed issue?
On Wed, Oct 11, 2017 at 6:46 PM, Ben Davison <ben.davi...@7digital.com> wrote: > Hi Dmitriy, > > Did you check out this thread "Incorrect consumer offsets after broker > restart 0.11.0.0" from Phil Luckhurst, it sounds similar. > > Thanks, > > Ben > > On Wed, Oct 11, 2017 at 4:44 PM Dmitriy Vsekhvalnov < > dvsekhval...@gmail.com> > wrote: > > > Hey, want to resurrect this thread. > > > > Decided to do idle test, where no load data is produced to topic at all. > > And when we kill #101 or #102 - nothing happening. But when we kill #200 > - > > consumers starts to re-consume old events from random position. > > > > Anybody have ideas what to check? I really expected that Kafka will fail > > symmetrical with respect to any broker. > > > > On Mon, Oct 9, 2017 at 6:26 PM, Dmitriy Vsekhvalnov < > > dvsekhval...@gmail.com> > > wrote: > > > > > Hi tao, > > > > > > we had unclean leader election enabled at the beginning. But then > > disabled > > > it and also reduced 'max.poll.records' value. It helped little bit. > > > > > > But after today's testing there is strong correlation between lag spike > > > and what broker we crash. For lowest ID (100) broker : > > > 1. it always at least 1-2 orders higher lag > > > 2. we start getting > > > > > > org.apache.kafka.clients.consumer.CommitFailedException: Commit > cannot be > > > completed since the group has already rebalanced and assigned the > > > partitions to another member. This means that the time between > subsequent > > > calls to poll() was longer than the configured max.poll.interval.ms, > > > which typically implies that the poll loop is spending too much time > > > message processing. You can address this either by increasing the > session > > > timeout or by reducing the maximum size of batches returned in poll() > > with > > > max.poll.records. > > > > > > 3. sometime re-consumption from random position > > > > > > And when we crashing other brokers (101, 102), it just lag spike of > ~10Ks > > > order, settle down quite quickly, no consumer exceptions. > > > > > > Totally lost what to try next. > > > > > > On Sat, Oct 7, 2017 at 2:41 AM, tao xiao <xiaotao...@gmail.com> wrote: > > > > > >> Do you have unclean leader election turned on? If killing 100 is the > > only > > >> way to reproduce the problem, it is possible with unclean leader > > election > > >> turned on that leadership was transferred to out of ISR follower which > > may > > >> not have the latest high watermark > > >> On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov < > > >> dvsekhval...@gmail.com> > > >> wrote: > > >> > > >> > About to verify hypothesis on monday, but looks like that in latest > > >> tests. > > >> > Need to double check. > > >> > > > >> > On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <schiz...@gmail.com> > > >> wrote: > > >> > > > >> > > So no matter in what sequence you shutdown brokers it is only 1 > that > > >> > causes > > >> > > the major problem? That would indeed be a bit weird. have you > > checked > > >> > > offsets of your consumer - right after offsets jump back - does it > > >> start > > >> > > from the topic start or does it go back to some random position? > > Have > > >> you > > >> > > checked if all offsets are actually being committed by consumers? > > >> > > > > >> > > fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov < > > >> > > dvsekhval...@gmail.com > > >> > > >: > > >> > > > > >> > > > Yeah, probably we can dig around. > > >> > > > > > >> > > > One more observation, the most lag/re-consumption trouble > > happening > > >> > when > > >> > > we > > >> > > > kill broker with lowest id (e.g. 100 from [100,101,102]). > > >> > > > When crashing other brokers - there is nothing special > happening, > > >> lag > > >> > > > growing little bit but nothing crazy (e.g. thousands, not > > millions). > > >> > > > > > >> > > > Is it sounds suspicious? > > >> > > > > > >> > > > On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov < > schiz...@gmail.com> > > >> > wrote: > > >> > > > > > >> > > > > Ted: when choosing earliest/latest you are saying: if it > happens > > >> that > > >> > > > there > > >> > > > > is no "valid" offset committed for a consumer (for whatever > > >> reason: > > >> > > > > bug/misconfiguration/no luck) it will be ok to start from the > > >> > beginning > > >> > > > or > > >> > > > > end of the topic. So if you are not ok with that you should > > choose > > >> > > none. > > >> > > > > > > >> > > > > Dmitriy: Ok. Then it is spring-kafka that maintains this > offset > > >> per > > >> > > > > partition state for you. it might also has that problem of > > leaving > > >> > > stale > > >> > > > > offsets lying around, After quickly looking through > > >> > > > > https://github.com/spring-projects/spring-kafka/blob/ > > >> > > > > 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/ > > >> > > > > main/java/org/springframework/kafka/listener/ > > >> > > > > KafkaMessageListenerContainer.java > > >> > > > > it looks possible since offsets map is not cleared upon > > partition > > >> > > > > revocation, but that is just a hypothesis. I have no > experience > > >> with > > >> > > > > spring-kafka. However since you say you consumers were always > > >> active > > >> > I > > >> > > > find > > >> > > > > this theory worth investigating. > > >> > > > > > > >> > > > > > > >> > > > > 2017-10-06 18:20 GMT+02:00 Vincent Dautremont < > > >> > > > > vincent.dautrem...@olamobile.com.invalid>: > > >> > > > > > > >> > > > > > is there a way to read messages on a topic partition from a > > >> > specific > > >> > > > node > > >> > > > > > we that we choose (and not by the topic partition leader) ? > > >> > > > > > I would like to read myself that each of the > > __consumer_offsets > > >> > > > partition > > >> > > > > > replicas have the same consumer group offset written in it > in > > >> it. > > >> > > > > > > > >> > > > > > On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov < > > >> > > > > > dvsekhval...@gmail.com> > > >> > > > > > wrote: > > >> > > > > > > > >> > > > > > > Stas: > > >> > > > > > > > > >> > > > > > > we rely on spring-kafka, it commits offsets "manually" > for > > us > > >> > > after > > >> > > > > > event > > >> > > > > > > handler completed. So it's kind of automatic once there is > > >> > constant > > >> > > > > > stream > > >> > > > > > > of events (no idle time, which is true for us). Though > it's > > >> not > > >> > > what > > >> > > > > pure > > >> > > > > > > kafka-client calls "automatic" (flush commits at fixed > > >> > intervals). > > >> > > > > > > > > >> > > > > > > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov < > > >> schiz...@gmail.com > > >> > > > > >> > > > > wrote: > > >> > > > > > > > > >> > > > > > > > You don't have autocmmit enables that means you commit > > >> offsets > > >> > > > > > yourself - > > >> > > > > > > > correct? If you store them per partition somewhere and > > fail > > >> to > > >> > > > clean > > >> > > > > it > > >> > > > > > > up > > >> > > > > > > > upon rebalance next time the consumer gets this > partition > > >> > > assigned > > >> > > > > > during > > >> > > > > > > > next rebalance it can commit old stale offset- can this > be > > >> the > > >> > > > case? > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov < > > >> > > > > > > > dvsekhval...@gmail.com > > >> > > > > > > > >: > > >> > > > > > > > > > >> > > > > > > > > Reprocessing same events again - is fine for us > > >> (idempotent). > > >> > > > While > > >> > > > > > > > loosing > > >> > > > > > > > > data is more critical. > > >> > > > > > > > > > > >> > > > > > > > > What are reasons of such behaviour? Consumers are > never > > >> idle, > > >> > > > > always > > >> > > > > > > > > commiting, probably something wrong with broker setup > > >> then? > > >> > > > > > > > > > > >> > > > > > > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu < > > >> yuzhih...@gmail.com> > > >> > > > > wrote: > > >> > > > > > > > > > > >> > > > > > > > > > Stas: > > >> > > > > > > > > > bq. using anything but none is not really an option > > >> > > > > > > > > > > > >> > > > > > > > > > If you have time, can you explain a bit more ? > > >> > > > > > > > > > > > >> > > > > > > > > > Thanks > > >> > > > > > > > > > > > >> > > > > > > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov < > > >> > > > schiz...@gmail.com > > >> > > > > > > > >> > > > > > > > wrote: > > >> > > > > > > > > > > > >> > > > > > > > > > > If you set auto.offset.reset to none next time it > > >> happens > > >> > > you > > >> > > > > > will > > >> > > > > > > be > > >> > > > > > > > > in > > >> > > > > > > > > > > much better position to find out what happens. > Also > > in > > >> > > > general > > >> > > > > > with > > >> > > > > > > > > > current > > >> > > > > > > > > > > semantics of offset reset policy IMO using > anything > > >> but > > >> > > none > > >> > > > is > > >> > > > > > not > > >> > > > > > > > > > really > > >> > > > > > > > > > > an option unless it is ok for consumer to loose > some > > >> data > > >> > > > > > (latest) > > >> > > > > > > or > > >> > > > > > > > > > > reprocess it second time (earliest). > > >> > > > > > > > > > > > > >> > > > > > > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu < > > >> > > yuzhih...@gmail.com > > >> > > > >: > > >> > > > > > > > > > > > > >> > > > > > > > > > > > Should Kafka log warning if log.retention.hours > is > > >> > lower > > >> > > > than > > >> > > > > > > > number > > >> > > > > > > > > of > > >> > > > > > > > > > > > hours specified by offsets.retention.minutes ? > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar < > > >> > > > > > > > manikumar.re...@gmail.com > > >> > > > > > > > > > > > >> > > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > normally, log.retention.hours (168hrs) should > > be > > >> > > higher > > >> > > > > than > > >> > > > > > > > > > > > > offsets.retention.minutes (336 hrs)? > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy > > >> Vsekhvalnov < > > >> > > > > > > > > > > > > dvsekhval...@gmail.com> > > >> > > > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Hi Ted, > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Broker: v0.11.0.0 > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Consumer: > > >> > > > > > > > > > > > > > kafka-clients v0.11.0.0 > > >> > > > > > > > > > > > > > auto.offset.reset = earliest > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu < > > >> > > > > > yuzhih...@gmail.com> > > >> > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > What's the value for auto.offset.reset ? > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Which release are you using ? > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Cheers > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy > > >> > > Vsekhvalnov < > > >> > > > > > > > > > > > > > > dvsekhval...@gmail.com> > > >> > > > > > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Hi all, > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > we several time faced situation where > > >> > > > consumer-group > > >> > > > > > > > started > > >> > > > > > > > > to > > >> > > > > > > > > > > > > > > re-consume > > >> > > > > > > > > > > > > > > > old events from beginning. Here is > > scenario: > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > 1. x3 broker kafka cluster on top of x3 > > node > > >> > > > > zookeeper > > >> > > > > > > > > > > > > > > > 2. RF=3 for all topics > > >> > > > > > > > > > > > > > > > 3. log.retention.hours=168 and > > >> > > > > > > > > offsets.retention.minutes=20160 > > >> > > > > > > > > > > > > > > > 4. running sustainable load (pushing > > events) > > >> > > > > > > > > > > > > > > > 5. doing disaster testing by randomly > > >> shutting > > >> > > > down 1 > > >> > > > > > of > > >> > > > > > > 3 > > >> > > > > > > > > > broker > > >> > > > > > > > > > > > > nodes > > >> > > > > > > > > > > > > > > > (then provision new broker back) > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Several times after bouncing broker we > > faced > > >> > > > > situation > > >> > > > > > > > where > > >> > > > > > > > > > > > consumer > > >> > > > > > > > > > > > > > > group > > >> > > > > > > > > > > > > > > > started to re-consume old events. > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > consumer group: > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > 1. enable.auto.commit = false > > >> > > > > > > > > > > > > > > > 2. tried graceful group shutdown, kill > -9 > > >> and > > >> > > > > > terminating > > >> > > > > > > > AWS > > >> > > > > > > > > > > nodes > > >> > > > > > > > > > > > > > > > 3. never experienced re-consumption for > > >> given > > >> > > > cases. > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > What can cause that old events > > >> re-consumption? > > >> > Is > > >> > > > it > > >> > > > > > > > related > > >> > > > > > > > > to > > >> > > > > > > > > > > > > > bouncing > > >> > > > > > > > > > > > > > > > one of brokers? What to search in a > logs? > > >> Any > > >> > > > broker > > >> > > > > > > > settings > > >> > > > > > > > > > to > > >> > > > > > > > > > > > try? > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Thanks in advance. > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > -- > > >> > > > > > The information transmitted is intended only for the person > or > > >> > entity > > >> > > > to > > >> > > > > > which it is addressed and may contain confidential and/or > > >> > privileged > > >> > > > > > material. Any review, retransmission, dissemination or other > > use > > >> > of, > > >> > > or > > >> > > > > > taking of any action in reliance upon, this information by > > >> persons > > >> > or > > >> > > > > > entities other than the intended recipient is prohibited. If > > you > > >> > > > received > > >> > > > > > this in error, please contact the sender and delete the > > material > > >> > from > > >> > > > any > > >> > > > > > computer. > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > > > -- > > > This email, including attachments, is private and confidential. If you have > received this email in error please notify the sender and delete it from > your system. Emails are not secure and may contain viruses. No liability > can be accepted for viruses that might be transferred by this email or any > attachment. Any unauthorised copying of this message or unauthorised > distribution and publication of the information contained herein are > prohibited. > > 7digital Group plc. Registered office: 69 Wilson Street, London EC2A 2BB. > Registered in England and Wales. Registered No. 04843573. >