Hi,
It's quite a complex script generated with ansible where we use a/b
deployment and honestly, I don't have full knowledge on it I can share the
general guidelines of what is done:

> - Any old volumes (from previous releases are removed) (named with suffix
> '-old')
> - Detach the volumes attached to the old host
> - Stop the service in the old host - uses systemctl stop kafka
> - Attempt to create a CNAME volume: this is a volume with the same name
> that will be attached to the new box. Except for very first run, this task
> is used to get the information about the existing volume. (no sufix)
> - A new volume is created as copy of the CNAME volume (named with suffix
> '-new')
> - The new volume is attached to the host/vm (named with suffix '-new')
> - The new volume is formated (except for very first run, its already
> formated)(named with suffix '-new')
> - The new volume is mounted (named with suffix '-new')
> - Start the service in the new host - uses systemctl start kafka
> - If everthing went well stopping/starting services:
>    - The volume no the old host is renamed with prefix '-old'.
>    - The new volume is renamed stripping the suffix '-new'.


I made a new experiment today with some interesting findings. Had 518
messages in a given topic, after a deployment lost 9 due to this problem in
partitions 13,15,16 and 17. All the errors I could find in the time
index files before the deployment (left is partition number):

11 -> timestamp mismatch on 685803 - offsets from 685801 to 685805, no
> message loss here
> 12 -> -1 error no indexes on the log - base segment was the last offset so
> ok
> 13 -> timestamp mismatch error on 823168 - offsets from 323168 to 823172,
> four messages lost
> 14 -> timestamp mismatch on 619257 - offsets from 619253 to 619258, no
> message loss here
> 15 -> timestamp mismatch on 658783 - offsets from 658783 to 658784, one
> message missing
> 16 -> timestamp mismatch on 623508 - offsets from 623508 to 623509, one
> message missing
> 17 -> timestamp mismatch on 515479 - offsets from 515479 to 515481, two
> messages missing


After the deployment, I took a look and the state was this:

> 11 -> timestamp mismatch error on 685803 -   same state
> 12 -> -1 error no indexes on the log - same state
> 13 -> Exception in thread "main" java.io.IOException: Permission denied
> 14 -> timestamp mismatch error on 619257 - same state
> 15 -> Exception in thread "main" java.io.IOException: Permission denied
> 16 -> Exception in thread "main" java.io.IOException: Permission denied
> 17 -> Exception in thread "main" java.io.IOException: Permission denied


Some conclusions at this point:

   - We only lost messages where the initial offset had a corrupted
   .timeindex file, this is, the base offset for the segment.
   - Immediately after the deployment, we were unable to open all the
   partitions where we lost messages: Permission denied.

This was yesterday at the end of the day, today I checked the number of
messages and it was reduced from 509 to 493. Also, the state of the
.timeindex files was changed:

11 -> Exception in thread "main" java.io.IOException: Permission denied -
> changed state
> 12 -> -1 error no indexes on the log
> 13 -> Exception in thread "main" java.io.IOException: Permission denied
> 14 -> Exception in thread "main" java.io.IOException: Permission denied -
> changed state
> 15 -> Exception in thread "main" java.io.IOException: Permission denied
> 16 -> Exception in thread "main" java.io.IOException: Permission denied
> 17 -> Exception in thread "main" java.io.IOException: Permission denied


So partition 11 and 14 timeindex files were the ones with the timestamp
mismatch error that didn't lose messages immediately after the deployment.
After the deployment being done and after the cluster being already running
both changed to permission denied and* all the messages inside those
partitions(11 & 14) were gone. *So this didn't happened only immediately
after the rolling deployment but actually also while the cluster was
running.

I have manually restarted a broker with systemctl stop (took 2/3 seconds) &
systemctl start all those "permission denied" errors were transformed into
"-1 error no indexes on the log" looking like the files were reset. The
other brokers still have permission denied.

Does this sound anything to you? I don't really have an idea of what could
be corrupting those index files.

Next things I will check:

   - When exactly those messages were deleted in partitions 11 and 14.
   - What happens if I have timeindex files with a "timestamp mismatch
   error" and manually start and stop a broker.

Once again, thanks for the efforts on awnsering.

Em qui., 30 de abr. de 2020 às 09:39, Goran Sliskovic
<gslis...@yahoo.com.invalid> escreveu:

>  Hi,
> It does look as index corruption... Can you post script that stops kafka?
>
>
>     On Wednesday, April 29, 2020, 06:38:18 PM GMT+2, JP MB <
> jose.brandao1...@gmail.com> wrote:
>
>  >
> > Can you try using the console consumer to display messages/keys and
> > timestamps ?
> > --property print.key=true --property print.timestamp=true
>
>
> There are a lot off messages so I'm picking an example without and with
> timeindex entry. All of them have a null key:
>
> Offset 57 CreateTime:1588074808027 Key:null  - no time index
> Offset 144 CreateTime:1588157145655 Key:null - has time index
>
>
> Hmm, how are you doing your rolling deploys?
>
> It's rollout deployment, we take one node down and spin up another a new
> one. One at a time.
>
> I'm wondering if the time indexes are being corrupted by unclean
> > shutdowns. I've been reading code and the only path I could find that led
> > to a largest
> > timestamp of 0 was, as you've discovered, where there was no time index.
> >  WRT to the corruption - the broker being SIGKILLed (systemctl by default
> > sends SIGKILL 90 seconds after SIGTERM, and our broker needed 120s to
> shut
> > down cleanly) has caused index corruption for us in the past - although
> in
> > our case it was recovered from automatically by the broker. Just took 2
> > hours.
>
>
> This would be a perfect justification for it but we use systemctl stop and
> it takes around 4 seconds to shut down so I believe it ends gracefully
> before SIGKILL?
>
> Also, are you moving between versions with these deploys?
>
> No, we have several clusters where this is happening. The information I
> showed you is from a cluster with version 2.3 but with 10.2 for inter
> broker protocol communication and log format. We have also experienced this
> in fully updated 2.4  and 2.4.1 clusters. But to sum, the experiences are
> done always deploying (again) the version already there.
>
> Thanks all for the efforts so far.
>
>
> ...

Reply via email to