Ok, so yeah it doesn't sound like that is what happened. on this partition.
FWIW all other 8 partitions were fine.

What I ended up doing was deleting partition N + all partitions > N (which
was about ~1 hour of data, given we just had a 8 hour outage this seemed
justifiable)

I can take a copy of the logs and try to search through them to find out
what happened. For future reference is there a way to force the kafka
server to perform recovery on all the segments?

If there had been a main() I could have run on the file to perform recovery
manually that would have been superb, not sure how easy that would be to
wire up?




On Sat, Apr 13, 2013 at 1:32 PM, Jay Kreps <jay.kr...@gmail.com> wrote:

> I think the error you are seeing is due to the gap in the log. In 0.7
> we validate that the log is contiguous and by deleting a segment you
> are missing a chunk which would lead to problems in fetches for
> offsets in that range. You can always safely delete a prefix of the
> log (i.e. that segment and everything before it). You could also
> rename the files to be contiguous if you had a lot of patience (i.e.
> the nth file needs to have a name corresponding to the n-1st file
> +length(n-1st file) if that makes sense...
>
> I guess the forensics question is how we ended up rolling the log with
> an invalid message, the broker should kill itself and then fix the log
> on recovery when that occurs.
>
> -Jay
>
> On Sat, Apr 13, 2013 at 11:27 AM, Jay Kreps <jay.kr...@gmail.com> wrote:
> > You should be able to just bounce the broker. Our default policy is
> > that if we run out of space we shut down the broker automatically as
> > in that case there is no guarantee on what has been written to disk.
> > On startup if a clean shutdown hasn't been performed the broker should
> > run a recovery procedure on the log that includes checksumming all
> > messages. Invalid messages will be removed. Sounds like this isn't
> > what happened?
> >
> > -Jay
> >
> > On Sat, Apr 13, 2013 at 10:30 AM, Matthew Rathbone
> > <matt...@foursquare.com> wrote:
> >> Hey guys,
> >>
> >> Due to a disk filling up, one of the segments has an invalid message in
> it.
> >> I have verified this using DumpLogSegments.
> >>
> >> How do I deal with this now? the invalid message is causing our Hadoop
> >> Consumer to fail.
> >>
> >> Is there a way to remove the invalid message from the segment? Removing
> the
> >> whole segment causes the broker to fail on startup with an error:
> >>
> >> java.lang.IllegalStateException: The following segments don't validate:
> >> <bad-file>, <bad-file+1>
> >>
> >> I'm happy losing that file if needs be, but I need to get this broker
> back
> >> up asap.
> >>
> >> --
> >> Matthew Rathbone
> >> Foursquare | Software Engineer | Server Engineering Team
> >> matt...@foursquare.com | @rathboma <http://twitter.com/rathboma> |
> >> 4sq<http://foursquare.com/rathboma>
>



-- 
Matthew Rathbone
Foursquare | Software Engineer | Server Engineering Team
matt...@foursquare.com | @rathboma <http://twitter.com/rathboma> |
4sq<http://foursquare.com/rathboma>

Reply via email to