Hi James, There was a bug in 0.11.0.0 that could cause all segments to be scanned during a restart. I believe that was fixed in subsequent 0.11.0.x releases.
Ismael On Fri, Jan 12, 2018 at 6:49 AM, James Cheng <wushuja...@gmail.com> wrote: > We saw this as well, when updating from 0.10.1.1 to 0.11.0.1. > > Have you restarted your brokers since then? Did it take 8h to start up > again, or did it take its normal 45 minutes? > > I don't think it's related to the crash/recovery. Rather, I think it's due > to the upgrade from 0.10.1.1 to 0.11.0.1 > > I think 0.11.0.0 introduced some new files in the log directories (maybe > the .snapshot files?). The first time 0.11.0.0 (or newer) started up, it > scanned the entire .log files to create... something. It scanned *all* the > segments, not just the most recent ones. I think that's why it took so > long. I think normally log recovery only looks at the most recent segments. > > We noticed this only on the FIRST boot when running 0.11+. From then on, > startups were our normal length of time. > > In your https://pastebin.com/tZqze4Ya, I see a line like: > [2018-01-05 13:43:34,776] INFO Completed load of log webapi-event-1 with 2 > log segments, log start offset 1068104 and log end offset 1236587 in 9547 > ms (kafka.log.Log) > > That line says that that partition took 9547ms (9.5 seconds) to > load/recover. We had large partitions that took 30 minutes to recover, on > that first boot. When I used strace to see what I/O the broker was doing, > it was reading ALL the segments for the partitions. > > -James > > > > > On Jan 11, 2018, at 10:56 AM, Vincent Rischmann <vinc...@rischmann.fr> > wrote: > > > > If anyone else has any idea, I'd love to hear it. > > > > Meanwhile, I'll resume upgrading my brokers and hope it doesn't crash > and/or take so much time for recovery. > > > > On Sat, Jan 6, 2018, at 7:25 PM, Vincent Rischmann wrote: > >> Hi, > >> > >> just to clarify: this is the cause of the crash > >> https://pastebin.com/GuF60kvF in the broker logs, which is why I > >> referenced https://issues.apache.org/jira/browse/KAFKA-4523 > >> > >> I had this crash some time ago and yesterday was in the process of > >> upgrading my brokers to 0.11.0.2 in part to address this bug, > >> unfortunately while stopping this particular broker it crashed. > >> > >> What I don't understand is why the recovery time after upgrading was so > >> high. A couple of month ago when a broker crashed due to this bug and > >> recovered it didn't take nearly as long. In fact, I never had a recovery > >> that long on any broker, even when they suffered a kernel panic or power > >> failure. > >> > >> We have quite a bit of data, however as I said with 0.10.1.1 it "only" > >> took around 45 minutes. The broker did do a lot of I/O while recovering > >> (to the point where even just running ls was painfully slow) but afair > >> it was the same every time a broker recovered. Volume of data hasn't > >> changed much since the last crash with 0.10.1.1, in fact I removed a lot > >> of data recently. > >> > >> Our brokers are running with 3 SATA disks in raid 0 (using mdadm), while > >> recovering yesterday atop reported around 200MB/s of read throughput. > >> > >> Here are some graphs from our monitoring: > >> > >> - CPU usage https://vrischmann.me/files/fr3/cpu.png > >> - Disk IO https://vrischmann.me/files/fr3/disk_io.png and > >> https://vrischmann.me/files/fr3/disk_usage.png > >> > >> On Sat, Jan 6, 2018, at 4:53 PM, Ted Yu wrote: > >>> Ismael: > >>> We're on the same page. > >>> > >>> 0.11.0.2 was released on 17 Nov 2017. > >>> > >>> By 'recently' in my previous email I meant the change was newer. > >>> > >>> Vincent: > >>> Did the machine your broker ran on experience power issue ? > >>> > >>> Cheers > >>> > >>> On Sat, Jan 6, 2018 at 7:36 AM, Ismael Juma <ism...@juma.me.uk> wrote: > >>> > >>>> Hi Ted, > >>>> > >>>> The change you mention is not part of 0.11.0.2. > >>>> > >>>> Ismael > >>>> > >>>> On Sat, Jan 6, 2018 at 3:31 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >>>> > >>>>> bq. WARN Found a corrupted index file due to requirement failed: > Corrupt > >>>>> index found, index file > >>>>> (/data/kafka/data-processed-15/00000000000054942918.index) > >>>>> > >>>>> Can you search backward for 00000000000054942918.index in the log to > see > >>>> if > >>>>> we can find the cause for corruption ? > >>>>> > >>>>> This part of code was recently changed by : > >>>>> > >>>>> KAFKA-6324; Change LogSegment.delete to deleteIfExists and harden log > >>>>> recovery > >>>>> > >>>>> Cheers > >>>>> > >>>>> On Sat, Jan 6, 2018 at 7:18 AM, Vincent Rischmann < > vinc...@rischmann.fr> > >>>>> wrote: > >>>>> > >>>>>> Here's an excerpt just after the broker started: > >>>>>> https://pastebin.com/tZqze4Ya > >>>>>> > >>>>>> After more than 8 hours of recovery the broker finally started. I > >>>> haven't > >>>>>> read through all 8 hours of log but the parts I looked at are like > the > >>>>>> pastebin. > >>>>>> > >>>>>> I'm not seeing much in the log cleaner logs either, they look > normal. > >>>> We > >>>>>> have a couple of compacted topics but seems only the consumer > offsets > >>>> is > >>>>>> ever compacted (the other topics don't have much traffic). > >>>>>> > >>>>>> On Sat, Jan 6, 2018, at 12:02 AM, Brett Rann wrote: > >>>>>>> What do the broker logs say its doing during all that time? > >>>>>>> > >>>>>>> There are some consumer offset / log cleaner bugs which caused us > >>>>>> similarly > >>>>>>> log delays. that was easily visible by watching the log cleaner > >>>>> activity > >>>>>> in > >>>>>>> the logs, and in our monitoring of partition sizes watching them go > >>>>> down, > >>>>>>> along with IO activity on the host for those files. > >>>>>>> > >>>>>>> On Sat, Jan 6, 2018 at 7:48 AM, Vincent Rischmann < > >>>>> vinc...@rischmann.fr> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hello, > >>>>>>>> > >>>>>>>> so I'm upgrading my brokers from 0.10.1.1 to 0.11.0.2 to fix this > >>>> bug > >>>>>>>> https://issues.apache.org/jira/browse/KAFKA-4523 > >>>>>>>> <https://issues.apache.org/jira/browse/KAFKA-4523> > >>>>>>>> Unfortunately while stopping one broker, it crashed exactly > because > >>>>> of > >>>>>>>> this bug. No big deal usually, except after restarting Kafka in > >>>>>> 0.11.0.2 > >>>>>>>> the recovery is taking a really long time. > >>>>>>>> I have around 6TB of data on that broker, and before when it > >>>> crashed > >>>>> it > >>>>>>>> usually took around 30 to 45 minutes to recover, but now I'm at > >>>>> almost > >>>>>>>> 5h since Kafka started and it's still not recovered. > >>>>>>>> I'm wondering what could have changed to have such a dramatic > >>>> effect > >>>>> on > >>>>>>>> recovery time ? Is there maybe something I can tweak to try to > >>>> reduce > >>>>>>>> the time ? > >>>>>>>> Thanks. > >>>>>>>> > >>>>>> > >>>>> > >>>> > >