Re: Improve the process of removing bookies from a cluster

Michael Marshall Wed, 08 Sep 2021 15:46:08 -0700

> One thing that running the draining on the local bookie doesn't cover,
> is that, if the bookie is down and unrecoverable, the bookie will
> never be drained, so the data on the bookie would remain
> underreplicated.


> Perhaps this is a different case, and needs to be handled differently,
> but it could also be handled by a mechanism similar to the data
> integrity.

I think about the draining case differently than autorecovery because in
the draining case, we're over-replicating the data before we remove the
bookie. In the case where a bookie node is no longer available, it is a
straight-forward use case for the autorecovery process. As you mentioned,
the draining process will require special logic to allow for
over-replication. Do you think it will be different enough from the
autorecovery process to put it on the bookie being drained or should it
still reside within the autorecovery process?

> Another aspect of this is cost. Without tiered storage, the variable
> that decides the number of nodes you need is the
> (throughput)*(retention). So, assuming that throughput doesn't change
> much, or is cyclical, there'll be very few autoscaling decisions taken
> (nodes need to stick around for retention).
> If you use tiered storage, the number of nodes needed is purely based
> on throughput. You'll have fewer bookies, but autoscaling will need to
> respond more frequently to variations in throughput.

Yes, that definitely makes sense. Tiered storage simplifies the equation in
many ways.

> No, I think bookie mode should be limited to read_write and read_only.

I agree with this, too. Further, BP-4 mentions states for "draining",
"draining failed", and "drained". It would make sense to expose whether or
not draining is in process or complete via an API. This should be a simple
calculation of whether or not any ledgers are stored on the bookie being
drained. However, I think "draining failed" is likely hard to quantify and
is something that an operator would need to decide. Are there error cases
that we know will lead to draining failed? Since we're draining without
ever going into an under replicated state, I can't think of a reason why
the draining would truly fail--it just might take a while to completely
replicate the data.

- Michael

On Wed, Sep 8, 2021 at 8:52 AM Ivan Kelly <iv...@apache.org> wrote:

> Hi Yang,
>
> > Besides the auditor, I think the external operator (whether a human
> > operator or an automation program) also cares about the "draining" state
> of
> > a bookie.
>
> This isn't a question of the internal model, but of how it is exposed.
> API-wise, it would not be a problem to expose draining as an endpoint
> as a bookie http endpoint, but ultimately that should call out to an
> endpoint in the auditor. "draining" isn't so much a state but an
> operation, that should only take place in the read-only state, and be
> performed by the auditor. The auditor sees and records that there is a
> draining operation active on the bookie. It should oversee that data
> is copied off of the bookie.
>
> > If the data is expected to be moved off the bookie by auto-recovery, the
> > bookie has to be set as "draining" to kick off the data migration, and
> > there should also be APIs to mark a bookie as "draining" and to check if
> > the bookie is in the "draining" state. Although "draining" is a special
> > case of "readonly", would it be more clear to make it another possible
> > value of `BookieMode` and provide similar APIs as for the readonly state?
>
> No, I think bookie mode should be limited to read_write and read_only.
> read-only is of interest to both the client and the bookie. The client
> needs to know which bookies are read-only so that it does not select
> it for writes. The bookie needs to know that it is read-only so that
> it doesn't accept new writes. The bookie doesn't care if there's a
> draining operation happening. It will just see read traffic.
>
> > Or do you have any suggestions on the management of the draining state
> and
> > relative APIs?
>
> I would add endpoint /api/v1/bookie/drain API.
> POST to this API calls out to the auditor to create a drain operation
> for that API and returns an ID.
> GET /api/v1/bookie/drain/<ID> returns the status of the drain
> operation (calling out to the auditor).
> DELETE /api/v1/bookie/drain/<ID> cancels the drain operation.
>
> One open question is how to decide that a drain is done? The existing
> autorecovery code does some really horrible things using zookeeper as
> a queue, and then modifies the ledger metadata at the end to remove
> the offending bookie. Using zookeeper as a queue is bad, and
> replication workers end up racing for locks. But this can decide when
> a bookie is empty because the bookie no longer exists in the metadata
> for any ledger. Thinking about it more, even if you did use the
> current underreplication stuff, it would need to be modified to allow
> it to copy from a live bookie, because I think it doesn't allow data
> from live bookies to be rereplicated.
> I think a better solution is to use the same mechanism as data
> integrity outlined earlier in the thread.
> Let's say you have a ledger L1, and you want to decommission bookie B4
> (there are 3 other bookies).
> Metadata looks like {ensembles: [{0: B1, B2, B4}]}
> When you want to decommission B4, the auditor updates the metadata to
> {ensembles: [{0: B1, B2, B4}], replacements: [{ensemble: 0, from: B4,
> to:B3}]}
> Data integrity kicks in on B3 (it can even be prompted to start by the
> auditor). It sees that it should have the data from ensemble 0, so
> does the copy, preferring B4 for the reads.
> Once the copy is complete, B3 updates the metadata to {ensembles: [{0:
> B1, B2, B3}]}.
> So decommissioning costs 2 ZK writes per ledger.
> To cancel the decommission, the replacements field is deleted from the
> metadata.
>
> -Ivan
>

Re: Improve the process of removing bookies from a cluster

Reply via email to