Hi guys, I just wanted to inform you that we solved our issue, it was indeed related to the volume switching process and some permission mess. Thanks, everyone for the efforts on finding the root cause.
Now as a question for a possible improvement, should kafka ever admit largestTime to be 0 in any scenario? I mean, from what I understood the .timeindex files where introduces to have more fine-grained control over the segment retention: https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index Can we use the segment .log file last modified time as a fallback? Best regards Em ter., 5 de mai. de 2020 às 10:13, JP MB <jose.brandao1...@gmail.com> escreveu: > Hi, > What I can see from the configurations: > >> log.dir = /tmp/kafka-logs (default) >> log.dirs = /var/kafkadata/data01/data > > > From the documentation log.dir is only used if log.dirs is not set, so > */var/kafkadata/data01/data > *is the folder used for logs. > > Regards > > Em ter., 5 de mai. de 2020 às 08:52, Nicolas Carlot > <nicolas.car...@chronopost.fr.invalid> escreveu: > >> Hi guys, still following your discussion even if it's out of my reach. >> Just been noticing that you use /tmp/ for your logs, dunno if it's a good >> idea :o https://issues.apache.org/jira/browse/KAFKA-3925 >> >> Le lun. 4 mai 2020 à 19:40, JP MB <jose.brandao1...@gmail.com> a écrit : >> >> > Here are the startup logs from a deployment where we lost 15 messages in >> > topic-p: >> > https://gist.github.com/josebrandao13/81271140e59e28eda7aaa777d2d3b02c >> > >> > .timeindex files state before the deployment: >> > *Partitions with messages: timestamp mismatch >> > *Partitions without messages: permission denied >> > >> > .timeindex files state before the deployment: >> > *All partitions without messages: permission denied, new files were >> > created. >> > >> > Don't see anything particular in the logs but you can see that the >> messages >> > are deleted with largestTime=0 and are from today morning. >> > >> > Em seg., 4 de mai. de 2020 às 11:37, JP MB <jose.brandao1...@gmail.com> >> > escreveu: >> > >> > > Hi guys, >> > > >> > > I'm gonna get back to this today, I get mixed feelings regarding the >> > > volumes being the cause. This volume switching is around for quite >> some >> > > time, in a lot of clusters, and we only started noticing this problem >> > when >> > > we updated some of them. Also, this only happens in *a few* of those >> > > .timeindex files and not in all of them. The .log files or .index >> files >> > > which are also on the volumes don't also have the problem. >> > > >> > > Additionally, I'm a bit confused on what should be the initial state >> of >> > > those .timeindex files. Sometimes I see "found log offset: -1", others >> > the >> > > timestamp mismatch error "Index timestamp: 0, log timestamp: >> > 1588583643582" >> > > and sometimes something like this "Indexed offset: 0, found log >> offset: >> > 28". >> > > >> > > So we have seen previously that whenever the timestamp mismatch error >> is >> > > present we lose messages. eventually after a deployment. Since this >> looks >> > > like the trigger for the problem I would like to understand how it can >> > > happen. So my question is, how can each of those different states of >> > > initialization for the .timeindexes happen? We can reproduce all of >> them >> > > when running Kafka locally. >> > > >> > > Meanwhile, I'm trying to reproduce the situation in our dev >> environments >> > > and get some startup logs and I will play with the log flush settings. >> > > >> > > Regards >> > > >> > > Em sáb., 2 de mai. de 2020 às 14:45, Liam Clarke-Hutchinson < >> > > liam.cla...@adscale.co.nz> escreveu: >> > > >> > >> Good luck JP, do try it with the volume switching commented out, and >> see >> > >> how it goes. >> > >> >> > >> On Fri, May 1, 2020 at 6:50 PM JP MB <jose.brandao1...@gmail.com> >> > wrote: >> > >> >> > >> > Thank you very much for the help anyway. >> > >> > >> > >> > Best regards >> > >> > >> > >> > On Fri, May 1, 2020, 00:54 Liam Clarke-Hutchinson < >> > >> > liam.cla...@adscale.co.nz> >> > >> > wrote: >> > >> > >> > >> > > So the logs show a healthy shutdown, so we can eliminate that as >> an >> > >> > issue. >> > >> > > I would look next at the volume management during a rollout >> based on >> > >> the >> > >> > > other error messages you had earlier about permission denied etc. >> > It's >> > >> > > possible there's some journalled but not flushed changes in those >> > time >> > >> > > indexes, but at this point we're getting into filesystem >> internals >> > >> which >> > >> > > aren't my forte. But if you can temporarily disable the volume >> > >> switching >> > >> > > and do a test roll out, see if you get the same problems or not, >> > would >> > >> > help >> > >> > > eliminate it or confirm it. >> > >> > > >> > >> > > Sorry I can't help further on that. >> > >> > > >> > >> > > On Fri, May 1, 2020 at 5:34 AM JP MB <jose.brandao1...@gmail.com >> > >> > >> wrote: >> > >> > > >> > >> > > > I took a bit because I needed logs of the server shutting down >> > when >> > >> > this >> > >> > > > occurs. Here they are, I can see some errors: >> > >> > > > >> > >> >> https://gist.github.com/josebrandao13/e8b82469d3e9ad91fbf38cf139b5a726 >> > >> > > > >> > >> > > > Regarding systemd, the closest I could find to TimeoutStopSec >> was >> > >> > > > DefaultTimeoutStopUSec=1min 30s that looks to be 90seconds. I >> > could >> > >> not >> > >> > > > find any KillSignal or RestartKillSignal. You can see the >> output >> > of >> > >> > > > systemctl show --all here: >> > >> > > > >> > >> >> https://gist.github.com/josebrandao13/f2dd646fab19b19f127981fce92d78c4 >> > >> > > > >> > >> > > > Once again, thanks for the help. >> > >> > > > >> > >> > > > Em qui., 30 de abr. de 2020 às 15:04, Liam Clarke-Hutchinson < >> > >> > > > liam.cla...@adscale.co.nz> escreveu: >> > >> > > > >> > >> > > > > I'd also suggest eyeballing your systemd conf to verify that >> > >> someone >> > >> > > > hasn't >> > >> > > > > set a very low TimeoutStopSec, or that >> > >> KillSignal/RestartKillSignal >> > >> > > > haven't >> > >> > > > > been configured to SIGKILL (confusingly named, imo, as the >> > default >> > >> > for >> > >> > > > > KillSignal is SIGTERM). >> > >> > > > > >> > >> > > > > Also, the Kafka broker logs at shutdown look very different >> if >> > it >> > >> > shut >> > >> > > > down >> > >> > > > > currently vs if it didn't. Could you perhaps put them in a >> Gist >> > >> and >> > >> > > email >> > >> > > > > the link? >> > >> > > > > >> > >> > > > > Just trying to make sure basic assumptions are holding :) >> > >> > > > > >> > >> > > > > On Fri, 1 May 2020, 1:21 am JP MB, < >> jose.brandao1...@gmail.com> >> > >> > wrote: >> > >> > > > > >> > >> > > > > > Hi, >> > >> > > > > > It's quite a complex script generated with ansible where we >> > use >> > >> a/b >> > >> > > > > > deployment and honestly, I don't have full knowledge on it >> I >> > can >> > >> > > share >> > >> > > > > the >> > >> > > > > > general guidelines of what is done: >> > >> > > > > > >> > >> > > > > > > - Any old volumes (from previous releases are removed) >> > (named >> > >> > with >> > >> > > > > suffix >> > >> > > > > > > '-old') >> > >> > > > > > > - Detach the volumes attached to the old host >> > >> > > > > > > - Stop the service in the old host - uses systemctl stop >> > kafka >> > >> > > > > > > - Attempt to create a CNAME volume: this is a volume with >> > the >> > >> > same >> > >> > > > name >> > >> > > > > > > that will be attached to the new box. Except for very >> first >> > >> run, >> > >> > > this >> > >> > > > > > task >> > >> > > > > > > is used to get the information about the existing volume. >> > (no >> > >> > > sufix) >> > >> > > > > > > - A new volume is created as copy of the CNAME volume >> (named >> > >> with >> > >> > > > > suffix >> > >> > > > > > > '-new') >> > >> > > > > > > - The new volume is attached to the host/vm (named with >> > suffix >> > >> > > > '-new') >> > >> > > > > > > - The new volume is formated (except for very first run, >> its >> > >> > > already >> > >> > > > > > > formated)(named with suffix '-new') >> > >> > > > > > > - The new volume is mounted (named with suffix '-new') >> > >> > > > > > > - Start the service in the new host - uses systemctl >> start >> > >> kafka >> > >> > > > > > > - If everthing went well stopping/starting services: >> > >> > > > > > > - The volume no the old host is renamed with prefix >> > '-old'. >> > >> > > > > > > - The new volume is renamed stripping the suffix >> '-new'. >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > I made a new experiment today with some interesting >> findings. >> > >> Had >> > >> > 518 >> > >> > > > > > messages in a given topic, after a deployment lost 9 due to >> > this >> > >> > > > problem >> > >> > > > > in >> > >> > > > > > partitions 13,15,16 and 17. All the errors I could find in >> the >> > >> time >> > >> > > > > > index files before the deployment (left is partition >> number): >> > >> > > > > > >> > >> > > > > > 11 -> timestamp mismatch on 685803 - offsets from 685801 to >> > >> 685805, >> > >> > > no >> > >> > > > > > > message loss here >> > >> > > > > > > 12 -> -1 error no indexes on the log - base segment was >> the >> > >> last >> > >> > > > offset >> > >> > > > > > so >> > >> > > > > > > ok >> > >> > > > > > > 13 -> timestamp mismatch error on 823168 - offsets from >> > >> 323168 to >> > >> > > > > 823172, >> > >> > > > > > > four messages lost >> > >> > > > > > > 14 -> timestamp mismatch on 619257 - offsets from 619253 >> to >> > >> > 619258, >> > >> > > > no >> > >> > > > > > > message loss here >> > >> > > > > > > 15 -> timestamp mismatch on 658783 - offsets from 658783 >> to >> > >> > 658784, >> > >> > > > one >> > >> > > > > > > message missing >> > >> > > > > > > 16 -> timestamp mismatch on 623508 - offsets from 623508 >> to >> > >> > 623509, >> > >> > > > one >> > >> > > > > > > message missing >> > >> > > > > > > 17 -> timestamp mismatch on 515479 - offsets from 515479 >> to >> > >> > 515481, >> > >> > > > two >> > >> > > > > > > messages missing >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > After the deployment, I took a look and the state was this: >> > >> > > > > > >> > >> > > > > > > 11 -> timestamp mismatch error on 685803 - same state >> > >> > > > > > > 12 -> -1 error no indexes on the log - same state >> > >> > > > > > > 13 -> Exception in thread "main" java.io.IOException: >> > >> Permission >> > >> > > > denied >> > >> > > > > > > 14 -> timestamp mismatch error on 619257 - same state >> > >> > > > > > > 15 -> Exception in thread "main" java.io.IOException: >> > >> Permission >> > >> > > > denied >> > >> > > > > > > 16 -> Exception in thread "main" java.io.IOException: >> > >> Permission >> > >> > > > denied >> > >> > > > > > > 17 -> Exception in thread "main" java.io.IOException: >> > >> Permission >> > >> > > > denied >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > Some conclusions at this point: >> > >> > > > > > >> > >> > > > > > - We only lost messages where the initial offset had a >> > >> corrupted >> > >> > > > > > .timeindex file, this is, the base offset for the >> segment. >> > >> > > > > > - Immediately after the deployment, we were unable to >> open >> > >> all >> > >> > the >> > >> > > > > > partitions where we lost messages: Permission denied. >> > >> > > > > > >> > >> > > > > > This was yesterday at the end of the day, today I checked >> the >> > >> > number >> > >> > > of >> > >> > > > > > messages and it was reduced from 509 to 493. Also, the >> state >> > of >> > >> the >> > >> > > > > > .timeindex files was changed: >> > >> > > > > > >> > >> > > > > > 11 -> Exception in thread "main" java.io.IOException: >> > Permission >> > >> > > > denied - >> > >> > > > > > > changed state >> > >> > > > > > > 12 -> -1 error no indexes on the log >> > >> > > > > > > 13 -> Exception in thread "main" java.io.IOException: >> > >> Permission >> > >> > > > denied >> > >> > > > > > > 14 -> Exception in thread "main" java.io.IOException: >> > >> Permission >> > >> > > > > denied - >> > >> > > > > > > changed state >> > >> > > > > > > 15 -> Exception in thread "main" java.io.IOException: >> > >> Permission >> > >> > > > denied >> > >> > > > > > > 16 -> Exception in thread "main" java.io.IOException: >> > >> Permission >> > >> > > > denied >> > >> > > > > > > 17 -> Exception in thread "main" java.io.IOException: >> > >> Permission >> > >> > > > denied >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > So partition 11 and 14 timeindex files were the ones with >> the >> > >> > > timestamp >> > >> > > > > > mismatch error that didn't lose messages immediately after >> the >> > >> > > > > deployment. >> > >> > > > > > After the deployment being done and after the cluster being >> > >> already >> > >> > > > > running >> > >> > > > > > both changed to permission denied and* all the messages >> inside >> > >> > those >> > >> > > > > > partitions(11 & 14) were gone. *So this didn't happened >> only >> > >> > > > immediately >> > >> > > > > > after the rolling deployment but actually also while the >> > cluster >> > >> > was >> > >> > > > > > running. >> > >> > > > > > >> > >> > > > > > I have manually restarted a broker with systemctl stop >> (took >> > 2/3 >> > >> > > > > seconds) & >> > >> > > > > > systemctl start all those "permission denied" errors were >> > >> > transformed >> > >> > > > > into >> > >> > > > > > "-1 error no indexes on the log" looking like the files >> were >> > >> reset. >> > >> > > The >> > >> > > > > > other brokers still have permission denied. >> > >> > > > > > >> > >> > > > > > Does this sound anything to you? I don't really have an >> idea >> > of >> > >> > what >> > >> > > > > could >> > >> > > > > > be corrupting those index files. >> > >> > > > > > >> > >> > > > > > Next things I will check: >> > >> > > > > > >> > >> > > > > > - When exactly those messages were deleted in >> partitions 11 >> > >> and >> > >> > > 14. >> > >> > > > > > - What happens if I have timeindex files with a >> "timestamp >> > >> > > mismatch >> > >> > > > > > error" and manually start and stop a broker. >> > >> > > > > > >> > >> > > > > > Once again, thanks for the efforts on awnsering. >> > >> > > > > > >> > >> > > > > > Em qui., 30 de abr. de 2020 às 09:39, Goran Sliskovic >> > >> > > > > > <gslis...@yahoo.com.invalid> escreveu: >> > >> > > > > > >> > >> > > > > > > Hi, >> > >> > > > > > > It does look as index corruption... Can you post script >> that >> > >> > stops >> > >> > > > > kafka? >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > On Wednesday, April 29, 2020, 06:38:18 PM GMT+2, JP >> MB < >> > >> > > > > > > jose.brandao1...@gmail.com> wrote: >> > >> > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > Can you try using the console consumer to display >> > >> messages/keys >> > >> > > and >> > >> > > > > > > > timestamps ? >> > >> > > > > > > > --property print.key=true --property >> print.timestamp=true >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > There are a lot off messages so I'm picking an example >> > without >> > >> > and >> > >> > > > with >> > >> > > > > > > timeindex entry. All of them have a null key: >> > >> > > > > > > >> > >> > > > > > > Offset 57 CreateTime:1588074808027 Key:null - no time >> index >> > >> > > > > > > Offset 144 CreateTime:1588157145655 Key:null - has time >> > index >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > Hmm, how are you doing your rolling deploys? >> > >> > > > > > > >> > >> > > > > > > It's rollout deployment, we take one node down and spin >> up >> > >> > another >> > >> > > a >> > >> > > > > new >> > >> > > > > > > one. One at a time. >> > >> > > > > > > >> > >> > > > > > > I'm wondering if the time indexes are being corrupted by >> > >> unclean >> > >> > > > > > > > shutdowns. I've been reading code and the only path I >> > could >> > >> > find >> > >> > > > that >> > >> > > > > > led >> > >> > > > > > > > to a largest >> > >> > > > > > > > timestamp of 0 was, as you've discovered, where there >> was >> > no >> > >> > time >> > >> > > > > > index. >> > >> > > > > > > > WRT to the corruption - the broker being SIGKILLed >> > >> (systemctl >> > >> > by >> > >> > > > > > default >> > >> > > > > > > > sends SIGKILL 90 seconds after SIGTERM, and our broker >> > >> needed >> > >> > > 120s >> > >> > > > to >> > >> > > > > > > shut >> > >> > > > > > > > down cleanly) has caused index corruption for us in the >> > >> past - >> > >> > > > > although >> > >> > > > > > > in >> > >> > > > > > > > our case it was recovered from automatically by the >> > broker. >> > >> > Just >> > >> > > > > took 2 >> > >> > > > > > > > hours. >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > This would be a perfect justification for it but we use >> > >> systemctl >> > >> > > > stop >> > >> > > > > > and >> > >> > > > > > > it takes around 4 seconds to shut down so I believe it >> ends >> > >> > > > gracefully >> > >> > > > > > > before SIGKILL? >> > >> > > > > > > >> > >> > > > > > > Also, are you moving between versions with these deploys? >> > >> > > > > > > >> > >> > > > > > > No, we have several clusters where this is happening. The >> > >> > > > information I >> > >> > > > > > > showed you is from a cluster with version 2.3 but with >> 10.2 >> > >> for >> > >> > > inter >> > >> > > > > > > broker protocol communication and log format. We have >> also >> > >> > > > experienced >> > >> > > > > > this >> > >> > > > > > > in fully updated 2.4 and 2.4.1 clusters. But to sum, the >> > >> > > experiences >> > >> > > > > are >> > >> > > > > > > done always deploying (again) the version already there. >> > >> > > > > > > >> > >> > > > > > > Thanks all for the efforts so far. >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > ... >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > > >> > >> >> >> -- >> *Nicolas Carlot* >> >> Lead dev >> | | nicolas.car...@chronopost.fr >> >> >> *Veuillez noter qu'à partir du 20 mai, le siège Chronopost déménage. La >> nouvelle adresse est : 3 boulevard Romain Rolland 75014 Paris* >> >> [image: Logo Chronopost] >> | chronopost.fr <http://www.chronopost.fr/> >> Suivez nous sur Facebook <https://fr-fr.facebook.com/chronopost> et >> Twitter >> <https://twitter.com/chronopost>. >> >> [image: DPD Group] >> >