On Sat, Aug 4, 2018 at 1:49 AM Ivan Kelly <iv...@apache.org> wrote:

> Hi folks,
>
> Recently I've been working to make the ledger metadata on the client
> immutable, with the goal of making client metadata management more
> understandable. The basic idea is that the metadata the client uses
> should reflect what is in zookeeper. So if a client wants to modify
> the metadata, if makes a copy, modifies, writes to zookeeper and then
> starts using it. This gets rid of all the confictsWith and merge
> operations.
>
> There is only one case where this doesn't work. When we recover a
> ledger, we read the LAC from all bookies, then read forward entry by
> entry, rewriting the entry, until we reach the end. If a bookie fails
> during the rewrite, we replace it in the ensemble, but we don't write
> that back to zookeeper until the end.
>
> I was banging my head off this yesterday, trying to find a nice way to
> fit this in (there's loads of nasty ways), when I came to the
> conclusion that failure recovery during recovery isn't actually
> useful.
>


> Recovery operates on a few seconds of data (from the last LAC written
> to the end of the ledger, call this LLAC).


the data during this duration can be very large if the traffic of the
ledger is large. That has
been observed at Twitter's production. so when we are talking about "a few
seconds of data",
we can't assume the amount of data is little. That says the recovery can be
taking time than
what we can expect, so if we don't handle failures during recovery how we
are able to ensure
we have enough data copy during recovery.

I am not sure "make ledger metadata immutable" == "getting rid of merging
ledger metadata".
because I don't think these are same thing. making ledger metadata
immutable will make code
much clearer and simpler because the ledger metadata is immutable. how
getting rid of merging
ledger metadata is a different thing, when you make ledger metadata
immutable, it will help make
merging ledger metadata on conflicts clearer.

In the ledger recovery case, it is actually okay to merge ledger metadata.
let's assume LAC is L at the
time of recovery, ledger metadata is M  is the copy before recovery. the
client that attempts to recovery
the ledger will first set the ledger to IN_RECOVERY first before recovering
the ledger. so the conflicts will
only coming from the clients (can be many) that attempt to recover and
AutoRecovery daemon. the resolution
of this conflict is simpler:

when fail to write ledger metadata (version conflicts), read back the
ledger metadata, if the state is changed
back to CLOSED, it means it is updated by other client that also recovers
the ledger, we discarded our ensemble;
if the state has been changed, that means it is modified by AutoRecovery,
AutoRecovery doesn't add ensembles,
so can simply take the ensembles before L from zookeeper and our ensembles
after L and merge them together.


> Take a ledger with 3:2:2
> configuration. If the writer crashes, and one bookie crashes, when we
> recover we currently replace that crashed bookie, so that if another
> bookie crashes the data is still available. But, and this is why I
> don't think it's useful, if another bookie crashes, the recovered data
> may be available, but everything before the LLAC in the ledger will
> not be available.

IMO, this kind of thing should be handled by rereplication, not
> ensemble change (as as aside, we should have a hint system to trigger
> rereplication ASAP on this ledger).


> Anyhow, I'd like to hear other opinions on this before proceeding.
> Recovery with ensemble changes can work. Rather than modifying the
> ledger, create a shadow ensemble list, and give entries from that to
> the writers, but with the current entanglement in the client, this is
> a bit nasty.
>
> Cheers,
> Ivan
>

Reply via email to