Todd Lipcon created HDFS-5058:
---------------------------------

             Summary: QJM should validate startLogSegment() more strictly
                 Key: HDFS-5058
                 URL: https://issues.apache.org/jira/browse/HDFS-5058
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: qjm
    Affects Versions: 3.0.0, 2.1.0-beta
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon


We've seen a small handful of times a case where one of the NNs in an HA 
cluster ends up with an fsimage checkpoint that falls in the middle of an edit 
segment. We're not sure yet how this happens, but one issue can happen as a 
result:
- Node has fsimage_500. Cluster has edits_1-1000, edits_1001_inprogress
- Node restarts, loads fsimage_500
- Node wants to become active. It calls selectInputStreams(500). Currently, 
this API logs a WARN that 500 falls in the middle of the 1-1000 segment, but 
continues and returns no results.
- Node calls startLogSegment(501).

Currently, the QJM will accept this (incorrectly). The node then crashes when 
it first tries to journal a real transaction, but it ends up leaving the 
edits_501_inprogress lying around, potentially causing more issues later.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to