In 0.8, if you turn on replication, it may not matter too much if a broker takes long to start up since data can still be served from the replicas. It may be possible to improve this by maintaining a flush checkpoint file on disk. We can then use that info to reduce the amount of the data to be recovered.
Thanks, Jun On Mon, May 6, 2013 at 3:07 PM, Jason Rosenberg <j...@squareup.com> wrote: > Recently, we had an issue where our kafka brokers were shut down hard (and > so did not write out the clean shutdown file). Thus on restart, it went > through all logs and ran a recovery on them. > > Unfortunately, this took a long time (on the order of 30 minutes). We have > a lot of topics (e.g. ~1000 or so). Is there anyway this can be done more > quickly, say in parallel? > > Also, it be done as a background process, so the server can start up and > start receiving messages, logs for incoming topics are prioritized in the > recovery process, and perhaps messages can still be buffered in memory > while the log recovery is happening? > > It seems onerous to block all activity for 30 minutes while a slow, serial, > recovery job happens.... > > Thoughts? > > Jason >