Re: Improve the process of removing bookies from a cluster

Ivan Kelly Wed, 08 Sep 2021 06:52:56 -0700

Hi Yang,

> Besides the auditor, I think the external operator (whether a human
> operator or an automation program) also cares about the "draining" state of
> a bookie.


This isn't a question of the internal model, but of how it is exposed.
API-wise, it would not be a problem to expose draining as an endpoint
as a bookie http endpoint, but ultimately that should call out to an
endpoint in the auditor. "draining" isn't so much a state but an
operation, that should only take place in the read-only state, and be
performed by the auditor. The auditor sees and records that there is a
draining operation active on the bookie. It should oversee that data
is copied off of the bookie.

> If the data is expected to be moved off the bookie by auto-recovery, the
> bookie has to be set as "draining" to kick off the data migration, and
> there should also be APIs to mark a bookie as "draining" and to check if
> the bookie is in the "draining" state. Although "draining" is a special
> case of "readonly", would it be more clear to make it another possible
> value of `BookieMode` and provide similar APIs as for the readonly state?

No, I think bookie mode should be limited to read_write and read_only.
read-only is of interest to both the client and the bookie. The client
needs to know which bookies are read-only so that it does not select
it for writes. The bookie needs to know that it is read-only so that
it doesn't accept new writes. The bookie doesn't care if there's a
draining operation happening. It will just see read traffic.

> Or do you have any suggestions on the management of the draining state and
> relative APIs?

I would add endpoint /api/v1/bookie/drain API.
POST to this API calls out to the auditor to create a drain operation
for that API and returns an ID.
GET /api/v1/bookie/drain/<ID> returns the status of the drain
operation (calling out to the auditor).
DELETE /api/v1/bookie/drain/<ID> cancels the drain operation.

One open question is how to decide that a drain is done? The existing
autorecovery code does some really horrible things using zookeeper as
a queue, and then modifies the ledger metadata at the end to remove
the offending bookie. Using zookeeper as a queue is bad, and
replication workers end up racing for locks. But this can decide when
a bookie is empty because the bookie no longer exists in the metadata
for any ledger. Thinking about it more, even if you did use the
current underreplication stuff, it would need to be modified to allow
it to copy from a live bookie, because I think it doesn't allow data
from live bookies to be rereplicated.
I think a better solution is to use the same mechanism as data
integrity outlined earlier in the thread.
Let's say you have a ledger L1, and you want to decommission bookie B4
(there are 3 other bookies).
Metadata looks like {ensembles: [{0: B1, B2, B4}]}
When you want to decommission B4, the auditor updates the metadata to
{ensembles: [{0: B1, B2, B4}], replacements: [{ensemble: 0, from: B4,
to:B3}]}
Data integrity kicks in on B3 (it can even be prompted to start by the
auditor). It sees that it should have the data from ensemble 0, so
does the copy, preferring B4 for the reads.
Once the copy is complete, B3 updates the metadata to {ensembles: [{0:
B1, B2, B3}]}.
So decommissioning costs 2 ZK writes per ledger.
To cancel the decommission, the replacements field is deleted from the metadata.

-Ivan

Re: Improve the process of removing bookies from a cluster

Reply via email to