Re: Kafka: Messages disappearing from topics, largestTime=0

JP MB Mon, 04 May 2020 10:40:20 -0700

Here are the startup logs from a deployment where we lost 15 messages in
topic-p:
https://gist.github.com/josebrandao13/81271140e59e28eda7aaa777d2d3b02c


.timeindex files state before the deployment:
*Partitions with messages: timestamp mismatch
*Partitions without messages: permission denied

.timeindex files state before the deployment:
*All partitions without messages: permission denied, new files were created.

Don't see anything particular in the logs but you can see that the messages
are deleted with largestTime=0 and are from today morning.

Em seg., 4 de mai. de 2020 às 11:37, JP MB <jose.brandao1...@gmail.com>
escreveu:

> Hi guys,
>
> I'm gonna get back to this today, I get mixed feelings regarding the
> volumes being the cause. This volume switching is around for quite some
> time, in a lot of clusters, and we only started noticing this problem when
> we updated some of them. Also, this only happens in *a few* of those
> .timeindex files and not in all of them. The .log files or .index files
> which are also on the volumes don't also have the problem.
>
> Additionally, I'm a bit confused on what should be the initial state of
> those .timeindex files. Sometimes I see "found log offset: -1", others the
> timestamp mismatch error "Index timestamp: 0, log timestamp: 1588583643582"
> and sometimes something like this "Indexed offset: 0, found log offset: 28".
>
> So we have seen previously that whenever the timestamp mismatch error is
> present we lose messages. eventually after a deployment. Since this looks
> like the trigger for the problem I would like to understand how it can
> happen. So my question is, how can each of those different states of
> initialization for the .timeindexes happen? We can reproduce all of them
> when running Kafka locally.
>
> Meanwhile, I'm trying to reproduce the situation in our dev environments
> and get some startup logs and I will play with the log flush settings.
>
> Regards
>
> Em sáb., 2 de mai. de 2020 às 14:45, Liam Clarke-Hutchinson <
> liam.cla...@adscale.co.nz> escreveu:
>
>> Good luck JP, do try it with the volume switching commented out, and see
>> how it goes.
>>
>> On Fri, May 1, 2020 at 6:50 PM JP MB <jose.brandao1...@gmail.com> wrote:
>>
>> > Thank you very much for the help anyway.
>> >
>> > Best regards
>> >
>> > On Fri, May 1, 2020, 00:54 Liam Clarke-Hutchinson <
>> > liam.cla...@adscale.co.nz>
>> > wrote:
>> >
>> > > So the logs show a healthy shutdown, so we can eliminate that as an
>> > issue.
>> > > I would look next at the volume management during a rollout based on
>> the
>> > > other error messages you had earlier about permission denied etc. It's
>> > > possible there's some journalled but not flushed changes in those time
>> > > indexes, but at this point we're getting into filesystem internals
>> which
>> > > aren't my forte. But if you can temporarily disable the volume
>> switching
>> > > and do a test roll out, see if you get the same problems or not, would
>> > help
>> > > eliminate it or confirm it.
>> > >
>> > > Sorry I can't help further on that.
>> > >
>> > > On Fri, May 1, 2020 at 5:34 AM JP MB <jose.brandao1...@gmail.com>
>> wrote:
>> > >
>> > > > I took a bit because I needed logs of the server shutting down when
>> > this
>> > > > occurs. Here they are, I can see some errors:
>> > > >
>> https://gist.github.com/josebrandao13/e8b82469d3e9ad91fbf38cf139b5a726
>> > > >
>> > > > Regarding systemd, the closest I could find to TimeoutStopSec was
>> > > > DefaultTimeoutStopUSec=1min 30s that looks to be 90seconds. I could
>> not
>> > > > find any KillSignal or RestartKillSignal. You can see the output of
>> > > > systemctl show --all here:
>> > > >
>> https://gist.github.com/josebrandao13/f2dd646fab19b19f127981fce92d78c4
>> > > >
>> > > > Once again, thanks for the help.
>> > > >
>> > > > Em qui., 30 de abr. de 2020 às 15:04, Liam Clarke-Hutchinson <
>> > > > liam.cla...@adscale.co.nz> escreveu:
>> > > >
>> > > > > I'd also suggest eyeballing your systemd conf to verify that
>> someone
>> > > > hasn't
>> > > > > set a very low TimeoutStopSec, or that
>> KillSignal/RestartKillSignal
>> > > > haven't
>> > > > > been configured to SIGKILL (confusingly named, imo, as the default
>> > for
>> > > > > KillSignal is SIGTERM).
>> > > > >
>> > > > > Also, the Kafka broker logs at shutdown look very different if it
>> > shut
>> > > > down
>> > > > > currently vs if it didn't. Could you perhaps put them in a Gist
>> and
>> > > email
>> > > > > the link?
>> > > > >
>> > > > > Just trying to make sure basic assumptions are holding :)
>> > > > >
>> > > > > On Fri, 1 May 2020, 1:21 am JP MB, <jose.brandao1...@gmail.com>
>> > wrote:
>> > > > >
>> > > > > > Hi,
>> > > > > > It's quite a complex script generated with ansible where we use
>> a/b
>> > > > > > deployment and honestly, I don't have full knowledge on it I can
>> > > share
>> > > > > the
>> > > > > > general guidelines of what is done:
>> > > > > >
>> > > > > > > - Any old volumes (from previous releases are removed) (named
>> > with
>> > > > > suffix
>> > > > > > > '-old')
>> > > > > > > - Detach the volumes attached to the old host
>> > > > > > > - Stop the service in the old host - uses systemctl stop kafka
>> > > > > > > - Attempt to create a CNAME volume: this is a volume with the
>> > same
>> > > > name
>> > > > > > > that will be attached to the new box. Except for very first
>> run,
>> > > this
>> > > > > > task
>> > > > > > > is used to get the information about the existing volume. (no
>> > > sufix)
>> > > > > > > - A new volume is created as copy of the CNAME volume (named
>> with
>> > > > > suffix
>> > > > > > > '-new')
>> > > > > > > - The new volume is attached to the host/vm (named with suffix
>> > > > '-new')
>> > > > > > > - The new volume is formated (except for very first run, its
>> > > already
>> > > > > > > formated)(named with suffix '-new')
>> > > > > > > - The new volume is mounted (named with suffix '-new')
>> > > > > > > - Start the service in the new host - uses systemctl start
>> kafka
>> > > > > > > - If everthing went well stopping/starting services:
>> > > > > > >    - The volume no the old host is renamed with prefix '-old'.
>> > > > > > >    - The new volume is renamed stripping the suffix '-new'.
>> > > > > >
>> > > > > >
>> > > > > > I made a new experiment today with some interesting findings.
>> Had
>> > 518
>> > > > > > messages in a given topic, after a deployment lost 9 due to this
>> > > > problem
>> > > > > in
>> > > > > > partitions 13,15,16 and 17. All the errors I could find in the
>> time
>> > > > > > index files before the deployment (left is partition number):
>> > > > > >
>> > > > > > 11 -> timestamp mismatch on 685803 - offsets from 685801 to
>> 685805,
>> > > no
>> > > > > > > message loss here
>> > > > > > > 12 -> -1 error no indexes on the log - base segment was the
>> last
>> > > > offset
>> > > > > > so
>> > > > > > > ok
>> > > > > > > 13 -> timestamp mismatch error on 823168 - offsets from
>> 323168 to
>> > > > > 823172,
>> > > > > > > four messages lost
>> > > > > > > 14 -> timestamp mismatch on 619257 - offsets from 619253 to
>> > 619258,
>> > > > no
>> > > > > > > message loss here
>> > > > > > > 15 -> timestamp mismatch on 658783 - offsets from 658783 to
>> > 658784,
>> > > > one
>> > > > > > > message missing
>> > > > > > > 16 -> timestamp mismatch on 623508 - offsets from 623508 to
>> > 623509,
>> > > > one
>> > > > > > > message missing
>> > > > > > > 17 -> timestamp mismatch on 515479 - offsets from 515479 to
>> > 515481,
>> > > > two
>> > > > > > > messages missing
>> > > > > >
>> > > > > >
>> > > > > > After the deployment, I took a look and the state was this:
>> > > > > >
>> > > > > > > 11 -> timestamp mismatch error on 685803 -   same state
>> > > > > > > 12 -> -1 error no indexes on the log - same state
>> > > > > > > 13 -> Exception in thread "main" java.io.IOException:
>> Permission
>> > > > denied
>> > > > > > > 14 -> timestamp mismatch error on 619257 - same state
>> > > > > > > 15 -> Exception in thread "main" java.io.IOException:
>> Permission
>> > > > denied
>> > > > > > > 16 -> Exception in thread "main" java.io.IOException:
>> Permission
>> > > > denied
>> > > > > > > 17 -> Exception in thread "main" java.io.IOException:
>> Permission
>> > > > denied
>> > > > > >
>> > > > > >
>> > > > > > Some conclusions at this point:
>> > > > > >
>> > > > > >    - We only lost messages where the initial offset had a
>> corrupted
>> > > > > >    .timeindex file, this is, the base offset for the segment.
>> > > > > >    - Immediately after the deployment, we were unable to open
>> all
>> > the
>> > > > > >    partitions where we lost messages: Permission denied.
>> > > > > >
>> > > > > > This was yesterday at the end of the day, today I checked the
>> > number
>> > > of
>> > > > > > messages and it was reduced from 509 to 493. Also, the state of
>> the
>> > > > > > .timeindex files was changed:
>> > > > > >
>> > > > > > 11 -> Exception in thread "main" java.io.IOException: Permission
>> > > > denied -
>> > > > > > > changed state
>> > > > > > > 12 -> -1 error no indexes on the log
>> > > > > > > 13 -> Exception in thread "main" java.io.IOException:
>> Permission
>> > > > denied
>> > > > > > > 14 -> Exception in thread "main" java.io.IOException:
>> Permission
>> > > > > denied -
>> > > > > > > changed state
>> > > > > > > 15 -> Exception in thread "main" java.io.IOException:
>> Permission
>> > > > denied
>> > > > > > > 16 -> Exception in thread "main" java.io.IOException:
>> Permission
>> > > > denied
>> > > > > > > 17 -> Exception in thread "main" java.io.IOException:
>> Permission
>> > > > denied
>> > > > > >
>> > > > > >
>> > > > > > So partition 11 and 14 timeindex files were the ones with the
>> > > timestamp
>> > > > > > mismatch error that didn't lose messages immediately after the
>> > > > > deployment.
>> > > > > > After the deployment being done and after the cluster being
>> already
>> > > > > running
>> > > > > > both changed to permission denied and* all the messages inside
>> > those
>> > > > > > partitions(11 & 14) were gone. *So this didn't happened only
>> > > > immediately
>> > > > > > after the rolling deployment but actually also while the cluster
>> > was
>> > > > > > running.
>> > > > > >
>> > > > > > I have manually restarted a broker with systemctl stop (took 2/3
>> > > > > seconds) &
>> > > > > > systemctl start all those "permission denied" errors were
>> > transformed
>> > > > > into
>> > > > > > "-1 error no indexes on the log" looking like the files were
>> reset.
>> > > The
>> > > > > > other brokers still have permission denied.
>> > > > > >
>> > > > > > Does this sound anything to you? I don't really have an idea of
>> > what
>> > > > > could
>> > > > > > be corrupting those index files.
>> > > > > >
>> > > > > > Next things I will check:
>> > > > > >
>> > > > > >    - When exactly those messages were deleted in partitions 11
>> and
>> > > 14.
>> > > > > >    - What happens if I have timeindex files with a "timestamp
>> > > mismatch
>> > > > > >    error" and manually start and stop a broker.
>> > > > > >
>> > > > > > Once again, thanks for the efforts on awnsering.
>> > > > > >
>> > > > > > Em qui., 30 de abr. de 2020 às 09:39, Goran Sliskovic
>> > > > > > <gslis...@yahoo.com.invalid> escreveu:
>> > > > > >
>> > > > > > >  Hi,
>> > > > > > > It does look as index corruption... Can you post script that
>> > stops
>> > > > > kafka?
>> > > > > > >
>> > > > > > >
>> > > > > > >     On Wednesday, April 29, 2020, 06:38:18 PM GMT+2, JP MB <
>> > > > > > > jose.brandao1...@gmail.com> wrote:
>> > > > > > >
>> > > > > > >  >
>> > > > > > > > Can you try using the console consumer to display
>> messages/keys
>> > > and
>> > > > > > > > timestamps ?
>> > > > > > > > --property print.key=true --property print.timestamp=true
>> > > > > > >
>> > > > > > >
>> > > > > > > There are a lot off messages so I'm picking an example without
>> > and
>> > > > with
>> > > > > > > timeindex entry. All of them have a null key:
>> > > > > > >
>> > > > > > > Offset 57 CreateTime:1588074808027 Key:null  - no time index
>> > > > > > > Offset 144 CreateTime:1588157145655 Key:null - has time index
>> > > > > > >
>> > > > > > >
>> > > > > > > Hmm, how are you doing your rolling deploys?
>> > > > > > >
>> > > > > > > It's rollout deployment, we take one node down and spin up
>> > another
>> > > a
>> > > > > new
>> > > > > > > one. One at a time.
>> > > > > > >
>> > > > > > > I'm wondering if the time indexes are being corrupted by
>> unclean
>> > > > > > > > shutdowns. I've been reading code and the only path I could
>> > find
>> > > > that
>> > > > > > led
>> > > > > > > > to a largest
>> > > > > > > > timestamp of 0 was, as you've discovered, where there was no
>> > time
>> > > > > > index.
>> > > > > > > >  WRT to the corruption - the broker being SIGKILLed
>> (systemctl
>> > by
>> > > > > > default
>> > > > > > > > sends SIGKILL 90 seconds after SIGTERM, and our broker
>> needed
>> > > 120s
>> > > > to
>> > > > > > > shut
>> > > > > > > > down cleanly) has caused index corruption for us in the
>> past -
>> > > > > although
>> > > > > > > in
>> > > > > > > > our case it was recovered from automatically by the broker.
>> > Just
>> > > > > took 2
>> > > > > > > > hours.
>> > > > > > >
>> > > > > > >
>> > > > > > > This would be a perfect justification for it but we use
>> systemctl
>> > > > stop
>> > > > > > and
>> > > > > > > it takes around 4 seconds to shut down so I believe it ends
>> > > > gracefully
>> > > > > > > before SIGKILL?
>> > > > > > >
>> > > > > > > Also, are you moving between versions with these deploys?
>> > > > > > >
>> > > > > > > No, we have several clusters where this is happening. The
>> > > > information I
>> > > > > > > showed you is from a cluster with version 2.3 but with 10.2
>> for
>> > > inter
>> > > > > > > broker protocol communication and log format. We have also
>> > > > experienced
>> > > > > > this
>> > > > > > > in fully updated 2.4  and 2.4.1 clusters. But to sum, the
>> > > experiences
>> > > > > are
>> > > > > > > done always deploying (again) the version already there.
>> > > > > > >
>> > > > > > > Thanks all for the efforts so far.
>> > > > > > >
>> > > > > > >
>> > > > > > > ...
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Kafka: Messages disappearing from topics, largestTime=0

Reply via email to