Hello again, I had the same problem again today. While stopping the broker it crashed, then after upgrading to 0.11.0.2 and restarting the broker it's again taking a really long time to recover. It's been almost 3 hours now and it's not done.
I restarted the previous broker which crashed, but since it's a clean shutdown it didn't recompute the indexes so it's not really the same. However, I think I can rule out a hardware problem since another broker now has the same problem. After I've upgraded everything I might trigger a crash voluntarily to see if it reproduces the problem. On Mon, Jan 15, 2018, at 10:04 AM, Ismael Juma wrote: > Hi James, > > There was a bug in 0.11.0.0 that could cause all segments to be scanned > during a restart. I believe that was fixed in subsequent 0.11.0.x releases. > > Ismael > > On Fri, Jan 12, 2018 at 6:49 AM, James Cheng <wushuja...@gmail.com> wrote: > > > We saw this as well, when updating from 0.10.1.1 to 0.11.0.1. > > > > Have you restarted your brokers since then? Did it take 8h to start up > > again, or did it take its normal 45 minutes? > > > > I don't think it's related to the crash/recovery. Rather, I think it's due > > to the upgrade from 0.10.1.1 to 0.11.0.1 > > > > I think 0.11.0.0 introduced some new files in the log directories (maybe > > the .snapshot files?). The first time 0.11.0.0 (or newer) started up, it > > scanned the entire .log files to create... something. It scanned *all* the > > segments, not just the most recent ones. I think that's why it took so > > long. I think normally log recovery only looks at the most recent segments. > > > > We noticed this only on the FIRST boot when running 0.11+. From then on, > > startups were our normal length of time. > > > > In your https://pastebin.com/tZqze4Ya, I see a line like: > > [2018-01-05 13:43:34,776] INFO Completed load of log webapi-event-1 with 2 > > log segments, log start offset 1068104 and log end offset 1236587 in 9547 > > ms (kafka.log.Log) > > > > That line says that that partition took 9547ms (9.5 seconds) to > > load/recover. We had large partitions that took 30 minutes to recover, on > > that first boot. When I used strace to see what I/O the broker was doing, > > it was reading ALL the segments for the partitions. > > > > -James > > > > > > > > > On Jan 11, 2018, at 10:56 AM, Vincent Rischmann <vinc...@rischmann.fr> > > wrote: > > > > > > If anyone else has any idea, I'd love to hear it. > > > > > > Meanwhile, I'll resume upgrading my brokers and hope it doesn't crash > > and/or take so much time for recovery. > > > > > > On Sat, Jan 6, 2018, at 7:25 PM, Vincent Rischmann wrote: > > >> Hi, > > >> > > >> just to clarify: this is the cause of the crash > > >> https://pastebin.com/GuF60kvF in the broker logs, which is why I > > >> referenced https://issues.apache.org/jira/browse/KAFKA-4523 > > >> > > >> I had this crash some time ago and yesterday was in the process of > > >> upgrading my brokers to 0.11.0.2 in part to address this bug, > > >> unfortunately while stopping this particular broker it crashed. > > >> > > >> What I don't understand is why the recovery time after upgrading was so > > >> high. A couple of month ago when a broker crashed due to this bug and > > >> recovered it didn't take nearly as long. In fact, I never had a recovery > > >> that long on any broker, even when they suffered a kernel panic or power > > >> failure. > > >> > > >> We have quite a bit of data, however as I said with 0.10.1.1 it "only" > > >> took around 45 minutes. The broker did do a lot of I/O while recovering > > >> (to the point where even just running ls was painfully slow) but afair > > >> it was the same every time a broker recovered. Volume of data hasn't > > >> changed much since the last crash with 0.10.1.1, in fact I removed a lot > > >> of data recently. > > >> > > >> Our brokers are running with 3 SATA disks in raid 0 (using mdadm), while > > >> recovering yesterday atop reported around 200MB/s of read throughput. > > >> > > >> Here are some graphs from our monitoring: > > >> > > >> - CPU usage https://vrischmann.me/files/fr3/cpu.png > > >> - Disk IO https://vrischmann.me/files/fr3/disk_io.png and > > >> https://vrischmann.me/files/fr3/disk_usage.png > > >> > > >> On Sat, Jan 6, 2018, at 4:53 PM, Ted Yu wrote: > > >>> Ismael: > > >>> We're on the same page. > > >>> > > >>> 0.11.0.2 was released on 17 Nov 2017. > > >>> > > >>> By 'recently' in my previous email I meant the change was newer. > > >>> > > >>> Vincent: > > >>> Did the machine your broker ran on experience power issue ? > > >>> > > >>> Cheers > > >>> > > >>> On Sat, Jan 6, 2018 at 7:36 AM, Ismael Juma <ism...@juma.me.uk> wrote: > > >>> > > >>>> Hi Ted, > > >>>> > > >>>> The change you mention is not part of 0.11.0.2. > > >>>> > > >>>> Ismael > > >>>> > > >>>> On Sat, Jan 6, 2018 at 3:31 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > >>>> > > >>>>> bq. WARN Found a corrupted index file due to requirement failed: > > Corrupt > > >>>>> index found, index file > > >>>>> (/data/kafka/data-processed-15/00000000000054942918.index) > > >>>>> > > >>>>> Can you search backward for 00000000000054942918.index in the log to > > see > > >>>> if > > >>>>> we can find the cause for corruption ? > > >>>>> > > >>>>> This part of code was recently changed by : > > >>>>> > > >>>>> KAFKA-6324; Change LogSegment.delete to deleteIfExists and harden log > > >>>>> recovery > > >>>>> > > >>>>> Cheers > > >>>>> > > >>>>> On Sat, Jan 6, 2018 at 7:18 AM, Vincent Rischmann < > > vinc...@rischmann.fr> > > >>>>> wrote: > > >>>>> > > >>>>>> Here's an excerpt just after the broker started: > > >>>>>> https://pastebin.com/tZqze4Ya > > >>>>>> > > >>>>>> After more than 8 hours of recovery the broker finally started. I > > >>>> haven't > > >>>>>> read through all 8 hours of log but the parts I looked at are like > > the > > >>>>>> pastebin. > > >>>>>> > > >>>>>> I'm not seeing much in the log cleaner logs either, they look > > normal. > > >>>> We > > >>>>>> have a couple of compacted topics but seems only the consumer > > offsets > > >>>> is > > >>>>>> ever compacted (the other topics don't have much traffic). > > >>>>>> > > >>>>>> On Sat, Jan 6, 2018, at 12:02 AM, Brett Rann wrote: > > >>>>>>> What do the broker logs say its doing during all that time? > > >>>>>>> > > >>>>>>> There are some consumer offset / log cleaner bugs which caused us > > >>>>>> similarly > > >>>>>>> log delays. that was easily visible by watching the log cleaner > > >>>>> activity > > >>>>>> in > > >>>>>>> the logs, and in our monitoring of partition sizes watching them go > > >>>>> down, > > >>>>>>> along with IO activity on the host for those files. > > >>>>>>> > > >>>>>>> On Sat, Jan 6, 2018 at 7:48 AM, Vincent Rischmann < > > >>>>> vinc...@rischmann.fr> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>>> Hello, > > >>>>>>>> > > >>>>>>>> so I'm upgrading my brokers from 0.10.1.1 to 0.11.0.2 to fix this > > >>>> bug > > >>>>>>>> https://issues.apache.org/jira/browse/KAFKA-4523 > > >>>>>>>> <https://issues.apache.org/jira/browse/KAFKA-4523> > > >>>>>>>> Unfortunately while stopping one broker, it crashed exactly > > because > > >>>>> of > > >>>>>>>> this bug. No big deal usually, except after restarting Kafka in > > >>>>>> 0.11.0.2 > > >>>>>>>> the recovery is taking a really long time. > > >>>>>>>> I have around 6TB of data on that broker, and before when it > > >>>> crashed > > >>>>> it > > >>>>>>>> usually took around 30 to 45 minutes to recover, but now I'm at > > >>>>> almost > > >>>>>>>> 5h since Kafka started and it's still not recovered. > > >>>>>>>> I'm wondering what could have changed to have such a dramatic > > >>>> effect > > >>>>> on > > >>>>>>>> recovery time ? Is there maybe something I can tweak to try to > > >>>> reduce > > >>>>>>>> the time ? > > >>>>>>>> Thanks. > > >>>>>>>> > > >>>>>> > > >>>>> > > >>>> > > > >