Re: Datanode fencing mechanism

Todd Lipcon Mon, 28 Oct 2013 20:01:02 -0700

Hi Liu Le,

You're correct, that's an oversight that was designed but never
implemented. It's quite a rare circumstance but we should probably
implement the persistent promise as you suggested. Want to have a try at
making a patch for trunk?


-Todd


On Mon, Oct 28, 2013 at 1:57 AM, lei liu <liulei...@gmail.com> wrote:

> In https://issues.apache.org/jira/browse/HDFS-1972 jira, there is one
> below
> case:
> Scenario 3: DN restarts during split brain period
>
> (this scenario illustrates why I think we need to persistently record the
> promise about who is active)
>
>    - block has 2 replicas, user asks to reduce to 1
>    - NN1 adds the block to DN1's invalidation queue, but it's backed up
>    behind a bunch of other commands, so doesn't get issued yet.
>    - Failover occurs, but NN1 still thinks it's active.
>    - DN1 promises to NN2 not to accept commands from NN1. It sends an empty
>    deletion report to NN2. Then, it crashes.
>    - NN2 has received a deletion report from everyone, and asks DN2 to
>    delete the block. It hasn't realized that DN1 is crashed yet.
>    - DN2 deletes the block.
>
>
>    - DN1 starts back up. When it comes back up, it talks to NN1 first
>    (maybe it takes a while to connect to NN2 for some reason)
>       - ** Now, if we had saved the "promise" as part of persistent state,
>       we could ignore NN1 and avoid this issue. Otherwise:
>       - NN1 still thinks it's active, and sends a command to DN1 to delete
>       the block. DN1 does so.
>       - We lost the bloc
>
>
> I am use the CDH4.3.1 version, and am reading the DataNode code. I don't
> find the DataNode to save the "promise" as part of persistent state.   I
> want to know whether the case 3 is handled in CDH4.3.1 version.  If  the
> case is hadnled, where is the code?
>
>
> Thanks,
>
> LiuLe
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Datanode fencing mechanism

Reply via email to