On Sun, Aug 5, 2018 at 11:46 PM Sijie Guo <guosi...@gmail.com> wrote:
> > > On Sat, Aug 4, 2018 at 1:49 AM Ivan Kelly <iv...@apache.org> wrote: > >> Hi folks, >> >> Recently I've been working to make the ledger metadata on the client >> immutable, with the goal of making client metadata management more >> understandable. The basic idea is that the metadata the client uses >> should reflect what is in zookeeper. So if a client wants to modify >> the metadata, if makes a copy, modifies, writes to zookeeper and then >> starts using it. This gets rid of all the confictsWith and merge >> operations. >> >> There is only one case where this doesn't work. When we recover a >> ledger, we read the LAC from all bookies, then read forward entry by >> entry, rewriting the entry, until we reach the end. If a bookie fails >> during the rewrite, we replace it in the ensemble, but we don't write >> that back to zookeeper until the end. >> >> I was banging my head off this yesterday, trying to find a nice way to >> fit this in (there's loads of nasty ways), when I came to the >> conclusion that failure recovery during recovery isn't actually >> useful. >> > > >> Recovery operates on a few seconds of data (from the last LAC written >> to the end of the ledger, call this LLAC). > > > the data during this duration can be very large if the traffic of the > ledger is large. That has > been observed at Twitter's production. so when we are talking about "a few > seconds of data", > we can't assume the amount of data is little. That says the recovery can > be taking time than > what we can expect, so if we don't handle failures during recovery how we > are able to ensure > we have enough data copy during recovery. > > I am not sure "make ledger metadata immutable" == "getting rid of merging > ledger metadata". > because I don't think these are same thing. making ledger metadata > immutable will make code > much clearer and simpler because the ledger metadata is immutable. how > getting rid of merging > ledger metadata is a different thing, when you make ledger metadata > immutable, it will help make > merging ledger metadata on conflicts clearer. > > In the ledger recovery case, it is actually okay to merge ledger metadata. > let's assume LAC is L at the > time of recovery, ledger metadata is M is the copy before recovery. the > client that attempts to recovery > the ledger will first set the ledger to IN_RECOVERY first before > recovering the ledger. so the conflicts will > only coming from the clients (can be many) that attempt to recover and > AutoRecovery daemon. the resolution > of this conflict is simpler: > > when fail to write ledger metadata (version conflicts), read back the > ledger metadata, if the state is changed > back to CLOSED, it means it is updated by other client that also recovers > the ledger, we discarded our ensemble; > if the state has been changed, that means it is modified by AutoRecovery, > AutoRecovery doesn't add ensembles, > sorry for typo => "if the state has not been changed" > so can simply take the ensembles before L from zookeeper and our ensembles > after L and merge them together. > > >> Take a ledger with 3:2:2 >> configuration. If the writer crashes, and one bookie crashes, when we >> recover we currently replace that crashed bookie, so that if another >> bookie crashes the data is still available. But, and this is why I >> don't think it's useful, if another bookie crashes, the recovered data >> may be available, but everything before the LLAC in the ledger will >> not be available. > > IMO, this kind of thing should be handled by rereplication, not >> ensemble change (as as aside, we should have a hint system to trigger >> rereplication ASAP on this ledger). > > >> Anyhow, I'd like to hear other opinions on this before proceeding. >> Recovery with ensemble changes can work. Rather than modifying the >> ledger, create a shadow ensemble list, and give entries from that to >> the writers, but with the current entanglement in the client, this is >> a bit nasty. >> >> Cheers, >> Ivan >> >