I'm still not sure what caused the reboot of the system (but yes it appears
to have crashed hard).  The file system is xfs, on CentOs linux.  I'm not
yet sure, but I think also before the crash, the system might have become
wedged.

It appears the corrupt recovery files actually contained all zero bytes,
after looking at it with odb.

I'll file a Jira.

On Thu, Nov 6, 2014 at 7:09 PM, Jun Rao <jun...@gmail.com> wrote:

> I am also wondering how the corruption happened. The way that we update the
> OffsetCheckpoint file is to first write to a tmp file and flush the data.
> We then rename the tmp file to the final file. This is done to prevent
> corruption caused by a crash in the middle of the writes. In your case, was
> the host crashed? What kind of storage system are you using? Is there any
> non-volatile cache on the storage system?
>
> Thanks,
>
> Jun
>
> On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg <j...@squareup.com> wrote:
>
> > Hi,
> >
> > We recently had a kafka node go down suddenly. When it came back up, it
> > apparently had a corrupt recovery file, and refused to startup:
> >
> > 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error
> > starting up KafkaServer
> > java.lang.NumberFormatException: For input string:
> >
> >
> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
> >
> >
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
> >         at
> >
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> >         at java.lang.Integer.parseInt(Integer.java:481)
> >         at java.lang.Integer.parseInt(Integer.java:527)
> >         at
> > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
> >         at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
> >         at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
> >         at
> > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
> >         at
> > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
> >         at
> >
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> >         at
> > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
> >         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
> >         at kafka.log.LogManager.<init>(LogManager.scala:57)
> >         at
> kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
> >         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
> >
> > And since the app is under a monitor (so it was repeatedly restarting and
> > failing with this error for several minutes before we got to it)…
> >
> > We moved the ‘recovery-point-offset-checkpoint’ file out of the way, and
> it
> > then restarted cleanly (but of course re-synced all it’s data from
> > replicas, so we had no data loss).
> >
> > Anyway, I’m wondering if that’s the expected behavior? Or should it not
> > declare it corrupt and then proceed automatically to an unclean restart?
> >
> > Should this NumberFormatException be handled a bit more gracefully?
> >
> > We saved the corrupt file if it’s worth inspecting (although I doubt it
> > will be useful!)….
> >
> > Jason
> > ​
> >
>

Reply via email to