filed: https://issues.apache.org/jira/browse/KAFKA-1758
On Thu, Nov 6, 2014 at 11:50 PM, Jason Rosenberg <j...@squareup.com> wrote: > I'm still not sure what caused the reboot of the system (but yes it > appears to have crashed hard). The file system is xfs, on CentOs linux. > I'm not yet sure, but I think also before the crash, the system might have > become wedged. > > It appears the corrupt recovery files actually contained all zero bytes, > after looking at it with odb. > > I'll file a Jira. > > On Thu, Nov 6, 2014 at 7:09 PM, Jun Rao <jun...@gmail.com> wrote: > >> I am also wondering how the corruption happened. The way that we update >> the >> OffsetCheckpoint file is to first write to a tmp file and flush the data. >> We then rename the tmp file to the final file. This is done to prevent >> corruption caused by a crash in the middle of the writes. In your case, >> was >> the host crashed? What kind of storage system are you using? Is there any >> non-volatile cache on the storage system? >> >> Thanks, >> >> Jun >> >> On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg <j...@squareup.com> wrote: >> >> > Hi, >> > >> > We recently had a kafka node go down suddenly. When it came back up, it >> > apparently had a corrupt recovery file, and refused to startup: >> > >> > 2014-11-06 08:17:19,299 WARN [main] server.KafkaServer - Error >> > starting up KafkaServer >> > java.lang.NumberFormatException: For input string: >> > >> > >> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ >> > >> > >> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@" >> > at >> > >> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) >> > at java.lang.Integer.parseInt(Integer.java:481) >> > at java.lang.Integer.parseInt(Integer.java:527) >> > at >> > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) >> > at >> scala.collection.immutable.StringOps.toInt(StringOps.scala:31) >> > at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76) >> > at >> > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106) >> > at >> > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105) >> > at >> > >> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) >> > at >> > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) >> > at kafka.log.LogManager.loadLogs(LogManager.scala:105) >> > at kafka.log.LogManager.<init>(LogManager.scala:57) >> > at >> kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275) >> > at kafka.server.KafkaServer.startup(KafkaServer.scala:72) >> > >> > And since the app is under a monitor (so it was repeatedly restarting >> and >> > failing with this error for several minutes before we got to it)… >> > >> > We moved the ‘recovery-point-offset-checkpoint’ file out of the way, >> and it >> > then restarted cleanly (but of course re-synced all it’s data from >> > replicas, so we had no data loss). >> > >> > Anyway, I’m wondering if that’s the expected behavior? Or should it not >> > declare it corrupt and then proceed automatically to an unclean restart? >> > >> > Should this NumberFormatException be handled a bit more gracefully? >> > >> > We saved the corrupt file if it’s worth inspecting (although I doubt it >> > will be useful!)…. >> > >> > Jason >> > >> > >> > >