If anyone else has any idea, I'd love to hear it. Meanwhile, I'll resume upgrading my brokers and hope it doesn't crash and/or take so much time for recovery.
On Sat, Jan 6, 2018, at 7:25 PM, Vincent Rischmann wrote: > Hi, > > just to clarify: this is the cause of the crash > https://pastebin.com/GuF60kvF in the broker logs, which is why I > referenced https://issues.apache.org/jira/browse/KAFKA-4523 > > I had this crash some time ago and yesterday was in the process of > upgrading my brokers to 0.11.0.2 in part to address this bug, > unfortunately while stopping this particular broker it crashed. > > What I don't understand is why the recovery time after upgrading was so > high. A couple of month ago when a broker crashed due to this bug and > recovered it didn't take nearly as long. In fact, I never had a recovery > that long on any broker, even when they suffered a kernel panic or power > failure. > > We have quite a bit of data, however as I said with 0.10.1.1 it "only" > took around 45 minutes. The broker did do a lot of I/O while recovering > (to the point where even just running ls was painfully slow) but afair > it was the same every time a broker recovered. Volume of data hasn't > changed much since the last crash with 0.10.1.1, in fact I removed a lot > of data recently. > > Our brokers are running with 3 SATA disks in raid 0 (using mdadm), while > recovering yesterday atop reported around 200MB/s of read throughput. > > Here are some graphs from our monitoring: > > - CPU usage https://vrischmann.me/files/fr3/cpu.png > - Disk IO https://vrischmann.me/files/fr3/disk_io.png and > https://vrischmann.me/files/fr3/disk_usage.png > > On Sat, Jan 6, 2018, at 4:53 PM, Ted Yu wrote: > > Ismael: > > We're on the same page. > > > > 0.11.0.2 was released on 17 Nov 2017. > > > > By 'recently' in my previous email I meant the change was newer. > > > > Vincent: > > Did the machine your broker ran on experience power issue ? > > > > Cheers > > > > On Sat, Jan 6, 2018 at 7:36 AM, Ismael Juma <ism...@juma.me.uk> wrote: > > > > > Hi Ted, > > > > > > The change you mention is not part of 0.11.0.2. > > > > > > Ismael > > > > > > On Sat, Jan 6, 2018 at 3:31 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > > > bq. WARN Found a corrupted index file due to requirement failed: Corrupt > > > > index found, index file > > > > (/data/kafka/data-processed-15/00000000000054942918.index) > > > > > > > > Can you search backward for 00000000000054942918.index in the log to see > > > if > > > > we can find the cause for corruption ? > > > > > > > > This part of code was recently changed by : > > > > > > > > KAFKA-6324; Change LogSegment.delete to deleteIfExists and harden log > > > > recovery > > > > > > > > Cheers > > > > > > > > On Sat, Jan 6, 2018 at 7:18 AM, Vincent Rischmann <vinc...@rischmann.fr> > > > > wrote: > > > > > > > > > Here's an excerpt just after the broker started: > > > > > https://pastebin.com/tZqze4Ya > > > > > > > > > > After more than 8 hours of recovery the broker finally started. I > > > haven't > > > > > read through all 8 hours of log but the parts I looked at are like the > > > > > pastebin. > > > > > > > > > > I'm not seeing much in the log cleaner logs either, they look normal. > > > We > > > > > have a couple of compacted topics but seems only the consumer offsets > > > is > > > > > ever compacted (the other topics don't have much traffic). > > > > > > > > > > On Sat, Jan 6, 2018, at 12:02 AM, Brett Rann wrote: > > > > > > What do the broker logs say its doing during all that time? > > > > > > > > > > > > There are some consumer offset / log cleaner bugs which caused us > > > > > similarly > > > > > > log delays. that was easily visible by watching the log cleaner > > > > activity > > > > > in > > > > > > the logs, and in our monitoring of partition sizes watching them go > > > > down, > > > > > > along with IO activity on the host for those files. > > > > > > > > > > > > On Sat, Jan 6, 2018 at 7:48 AM, Vincent Rischmann < > > > > vinc...@rischmann.fr> > > > > > > wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > so I'm upgrading my brokers from 0.10.1.1 to 0.11.0.2 to fix this > > > bug > > > > > > > https://issues.apache.org/jira/browse/KAFKA-4523 > > > > > > > <https://issues.apache.org/jira/browse/KAFKA-4523> > > > > > > > Unfortunately while stopping one broker, it crashed exactly > > > > > > > because > > > > of > > > > > > > this bug. No big deal usually, except after restarting Kafka in > > > > > 0.11.0.2 > > > > > > > the recovery is taking a really long time. > > > > > > > I have around 6TB of data on that broker, and before when it > > > crashed > > > > it > > > > > > > usually took around 30 to 45 minutes to recover, but now I'm at > > > > almost > > > > > > > 5h since Kafka started and it's still not recovered. > > > > > > > I'm wondering what could have changed to have such a dramatic > > > effect > > > > on > > > > > > > recovery time ? Is there maybe something I can tweak to try to > > > reduce > > > > > > > the time ? > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > >