Hi folks, Recently I've been working to make the ledger metadata on the client immutable, with the goal of making client metadata management more understandable. The basic idea is that the metadata the client uses should reflect what is in zookeeper. So if a client wants to modify the metadata, if makes a copy, modifies, writes to zookeeper and then starts using it. This gets rid of all the confictsWith and merge operations.
There is only one case where this doesn't work. When we recover a ledger, we read the LAC from all bookies, then read forward entry by entry, rewriting the entry, until we reach the end. If a bookie fails during the rewrite, we replace it in the ensemble, but we don't write that back to zookeeper until the end. I was banging my head off this yesterday, trying to find a nice way to fit this in (there's loads of nasty ways), when I came to the conclusion that failure recovery during recovery isn't actually useful. Recovery operates on a few seconds of data (from the last LAC written to the end of the ledger, call this LLAC). Take a ledger with 3:2:2 configuration. If the writer crashes, and one bookie crashes, when we recover we currently replace that crashed bookie, so that if another bookie crashes the data is still available. But, and this is why I don't think it's useful, if another bookie crashes, the recovered data may be available, but everything before the LLAC in the ledger will not be available. IMO, this kind of thing should be handled by rereplication, not ensemble change (as as aside, we should have a hint system to trigger rereplication ASAP on this ledger). Anyhow, I'd like to hear other opinions on this before proceeding. Recovery with ensemble changes can work. Rather than modifying the ledger, create a shadow ensemble list, and give entries from that to the writers, but with the current entanglement in the client, this is a bit nasty. Cheers, Ivan