Re: corrupt recovery checkpoint file issue....

Jason Rosenberg Thu, 06 Nov 2014 20:57:38 -0800

filed: https://issues.apache.org/jira/browse/KAFKA-1758


On Thu, Nov 6, 2014 at 11:50 PM, Jason Rosenberg <j...@squareup.com> wrote:

> I'm still not sure what caused the reboot of the system (but yes it
> appears to have crashed hard).  The file system is xfs, on CentOs linux.
> I'm not yet sure, but I think also before the crash, the system might have
> become wedged.
>
> It appears the corrupt recovery files actually contained all zero bytes,
> after looking at it with odb.
>
> I'll file a Jira.
>
> On Thu, Nov 6, 2014 at 7:09 PM, Jun Rao <jun...@gmail.com> wrote:
>
>> I am also wondering how the corruption happened. The way that we update
>> the
>> OffsetCheckpoint file is to first write to a tmp file and flush the data.
>> We then rename the tmp file to the final file. This is done to prevent
>> corruption caused by a crash in the middle of the writes. In your case,
>> was
>> the host crashed? What kind of storage system are you using? Is there any
>> non-volatile cache on the storage system?
>>
>> Thanks,
>>
>> Jun
>>
>> On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg <j...@squareup.com> wrote:
>>
>> > Hi,
>> >
>> > We recently had a kafka node go down suddenly. When it came back up, it
>> > apparently had a corrupt recovery file, and refused to startup:
>> >
>> > 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error
>> > starting up KafkaServer
>> > java.lang.NumberFormatException: For input string:
>> >
>> >
>> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
>> >
>> >
>> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
>> >         at
>> >
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>> >         at java.lang.Integer.parseInt(Integer.java:481)
>> >         at java.lang.Integer.parseInt(Integer.java:527)
>> >         at
>> > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>> >         at
>> scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>> >         at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
>> >         at
>> > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
>> >         at
>> > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
>> >         at
>> >
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> >         at
>> > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>> >         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
>> >         at kafka.log.LogManager.<init>(LogManager.scala:57)
>> >         at
>> kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
>> >         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
>> >
>> > And since the app is under a monitor (so it was repeatedly restarting
>> and
>> > failing with this error for several minutes before we got to it)…
>> >
>> > We moved the ‘recovery-point-offset-checkpoint’ file out of the way,
>> and it
>> > then restarted cleanly (but of course re-synced all it’s data from
>> > replicas, so we had no data loss).
>> >
>> > Anyway, I’m wondering if that’s the expected behavior? Or should it not
>> > declare it corrupt and then proceed automatically to an unclean restart?
>> >
>> > Should this NumberFormatException be handled a bit more gracefully?
>> >
>> > We saved the corrupt file if it’s worth inspecting (although I doubt it
>> > will be useful!)….
>> >
>> > Jason
>> > 
>> >
>>
>
>

Re: corrupt recovery checkpoint file issue....

Reply via email to