Re: Insanely long recovery time with Kafka 0.11.0.2

Vincent Rischmann Mon, 15 Jan 2018 09:59:39 -0800

Hello again,

I had the same problem again today. While stopping the broker it crashed, then 
after upgrading to 0.11.0.2 and restarting the broker it's again taking a 
really long time to recover. It's been almost 3 hours now and it's not done.


I restarted the previous broker which crashed, but since it's a clean shutdown 
it didn't recompute the indexes so it's not really the same. However, I think I 
can rule out a hardware problem since another broker now has the same problem.

After I've upgraded everything I might trigger a crash voluntarily to see if it 
reproduces the problem.

On Mon, Jan 15, 2018, at 10:04 AM, Ismael Juma wrote:
> Hi James,
> 
> There was a bug in 0.11.0.0 that could cause all segments to be scanned
> during a restart. I believe that was fixed in subsequent 0.11.0.x releases.
> 
> Ismael
> 
> On Fri, Jan 12, 2018 at 6:49 AM, James Cheng <wushuja...@gmail.com> wrote:
> 
> > We saw this as well, when updating from 0.10.1.1 to 0.11.0.1.
> >
> > Have you restarted your brokers since then? Did it take 8h to start up
> > again, or did it take its normal 45 minutes?
> >
> > I don't think it's related to the crash/recovery. Rather, I think it's due
> > to the upgrade from 0.10.1.1 to 0.11.0.1
> >
> > I think 0.11.0.0 introduced some new files in the log directories (maybe
> > the .snapshot files?). The first time 0.11.0.0 (or newer) started up, it
> > scanned the entire .log files to create... something. It scanned *all* the
> > segments, not just the most recent ones. I think that's why it took so
> > long. I think normally log recovery only looks at the most recent segments.
> >
> > We noticed this only on the FIRST boot when running 0.11+. From then on,
> > startups were our normal length of time.
> >
> > In your https://pastebin.com/tZqze4Ya, I see a line like:
> > [2018-01-05 13:43:34,776] INFO Completed load of log webapi-event-1 with 2
> > log segments, log start offset 1068104 and log end offset 1236587 in 9547
> > ms (kafka.log.Log)
> >
> > That line says that that partition took 9547ms (9.5 seconds) to
> > load/recover. We had large partitions that took 30 minutes to recover, on
> > that first boot. When I used strace to see what I/O the broker was doing,
> > it was reading ALL the segments for the partitions.
> >
> > -James
> >
> >
> >
> > > On Jan 11, 2018, at 10:56 AM, Vincent Rischmann <vinc...@rischmann.fr>
> > wrote:
> > >
> > > If anyone else has any idea, I'd love to hear it.
> > >
> > > Meanwhile, I'll resume upgrading my brokers and hope it doesn't crash
> > and/or take so much time for recovery.
> > >
> > > On Sat, Jan 6, 2018, at 7:25 PM, Vincent Rischmann wrote:
> > >> Hi,
> > >>
> > >> just to clarify: this is the cause of the crash
> > >> https://pastebin.com/GuF60kvF in the broker logs, which is why I
> > >> referenced https://issues.apache.org/jira/browse/KAFKA-4523
> > >>
> > >> I had this crash some time ago and yesterday was in the process of
> > >> upgrading my brokers to 0.11.0.2 in part to address this bug,
> > >> unfortunately while stopping this particular broker it crashed.
> > >>
> > >> What I don't understand is why the recovery time after upgrading was so
> > >> high. A couple of month ago when a broker crashed due to this bug and
> > >> recovered it didn't take nearly as long. In fact, I never had a recovery
> > >> that long on any broker, even when they suffered a kernel panic or power
> > >> failure.
> > >>
> > >> We have quite a bit of data, however as I said with 0.10.1.1 it "only"
> > >> took around 45 minutes. The broker did do a lot of I/O while recovering
> > >> (to the point where even just running ls was painfully slow) but afair
> > >> it was the same every time a broker recovered. Volume of data hasn't
> > >> changed much since the last crash with 0.10.1.1, in fact I removed a lot
> > >> of data recently.
> > >>
> > >> Our brokers are running with 3 SATA disks in raid 0 (using mdadm), while
> > >> recovering yesterday atop reported around 200MB/s of read throughput.
> > >>
> > >> Here are some graphs from our monitoring:
> > >>
> > >> - CPU usage https://vrischmann.me/files/fr3/cpu.png
> > >> - Disk IO https://vrischmann.me/files/fr3/disk_io.png and
> > >> https://vrischmann.me/files/fr3/disk_usage.png
> > >>
> > >> On Sat, Jan 6, 2018, at 4:53 PM, Ted Yu wrote:
> > >>> Ismael:
> > >>> We're on the same page.
> > >>>
> > >>> 0.11.0.2 was released on 17 Nov 2017.
> > >>>
> > >>> By 'recently' in my previous email I meant the change was newer.
> > >>>
> > >>> Vincent:
> > >>> Did the machine your broker ran on experience power issue ?
> > >>>
> > >>> Cheers
> > >>>
> > >>> On Sat, Jan 6, 2018 at 7:36 AM, Ismael Juma <ism...@juma.me.uk> wrote:
> > >>>
> > >>>> Hi Ted,
> > >>>>
> > >>>> The change you mention is not part of 0.11.0.2.
> > >>>>
> > >>>> Ismael
> > >>>>
> > >>>> On Sat, Jan 6, 2018 at 3:31 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> > >>>>
> > >>>>> bq. WARN Found a corrupted index file due to requirement failed:
> > Corrupt
> > >>>>> index found, index file
> > >>>>> (/data/kafka/data-processed-15/00000000000054942918.index)
> > >>>>>
> > >>>>> Can you search backward for 00000000000054942918.index in the log to
> > see
> > >>>> if
> > >>>>> we can find the cause for corruption ?
> > >>>>>
> > >>>>> This part of code was recently changed by :
> > >>>>>
> > >>>>> KAFKA-6324; Change LogSegment.delete to deleteIfExists and harden log
> > >>>>> recovery
> > >>>>>
> > >>>>> Cheers
> > >>>>>
> > >>>>> On Sat, Jan 6, 2018 at 7:18 AM, Vincent Rischmann <
> > vinc...@rischmann.fr>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Here's an excerpt just after the broker started:
> > >>>>>> https://pastebin.com/tZqze4Ya
> > >>>>>>
> > >>>>>> After more than 8 hours of recovery the broker finally started. I
> > >>>> haven't
> > >>>>>> read through all 8 hours of log but the parts I looked at are like
> > the
> > >>>>>> pastebin.
> > >>>>>>
> > >>>>>> I'm not seeing much in the log cleaner logs either, they look
> > normal.
> > >>>> We
> > >>>>>> have a couple of compacted topics but seems only the consumer
> > offsets
> > >>>> is
> > >>>>>> ever compacted (the other topics don't have much traffic).
> > >>>>>>
> > >>>>>> On Sat, Jan 6, 2018, at 12:02 AM, Brett Rann wrote:
> > >>>>>>> What do the broker logs say its doing during all that time?
> > >>>>>>>
> > >>>>>>> There are some consumer offset / log cleaner bugs which caused us
> > >>>>>> similarly
> > >>>>>>> log delays. that was easily visible by watching the log cleaner
> > >>>>> activity
> > >>>>>> in
> > >>>>>>> the logs, and in our monitoring of partition sizes watching them go
> > >>>>> down,
> > >>>>>>> along with IO activity on the host for those files.
> > >>>>>>>
> > >>>>>>> On Sat, Jan 6, 2018 at 7:48 AM, Vincent Rischmann <
> > >>>>> vinc...@rischmann.fr>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hello,
> > >>>>>>>>
> > >>>>>>>> so I'm upgrading my brokers from 0.10.1.1 to 0.11.0.2 to fix this
> > >>>> bug
> > >>>>>>>> https://issues.apache.org/jira/browse/KAFKA-4523
> > >>>>>>>> <https://issues.apache.org/jira/browse/KAFKA-4523>
> > >>>>>>>> Unfortunately while stopping one broker, it crashed exactly
> > because
> > >>>>> of
> > >>>>>>>> this bug. No big deal usually, except after restarting Kafka in
> > >>>>>> 0.11.0.2
> > >>>>>>>> the recovery is taking a really long time.
> > >>>>>>>> I have around 6TB of data on that broker, and before when it
> > >>>> crashed
> > >>>>> it
> > >>>>>>>> usually took around 30 to 45 minutes to recover, but now I'm at
> > >>>>> almost
> > >>>>>>>> 5h since Kafka started and it's still not recovered.
> > >>>>>>>> I'm wondering what could have changed to have such a dramatic
> > >>>> effect
> > >>>>> on
> > >>>>>>>> recovery time ? Is there maybe something I can tweak to try to
> > >>>> reduce
> > >>>>>>>> the time ?
> > >>>>>>>> Thanks.
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> >
> >

Re: Insanely long recovery time with Kafka 0.11.0.2

Reply via email to