Re: Improve the process of removing bookies from a cluster

Michael Marshall Tue, 07 Sep 2021 20:38:36 -0700

Hi Ivan and Yang,

++1 I am very happy to see this initiative. It will be a fantastic
improvement, and I am happy to help contribute, if help is needed.

I agree that the first step is adding an endpoint to mark a bookie as read
only in a persistent way, and that the "draining" state only really needs
to be known by the process responsible for evacuating the bookie's data.

I am not very familiar with bookkeeper and auditor history (so please let
me know if this understanding doesn't work), but it seems to me that the
process responsible for draining the bookie could be local to the bookie
itself to limit network hops. Running the auditor as its own process, like
a deployment in Kubernetes, can be helpful. However, I think this case is
different. We are focused on a controlled and graceful shutdown where the
bookie hosting the data to be replicated is available. Further, the bookie
to be removed will be read only, so it won't be handling write traffic and
it'll be de-prioritized for reads by default anyway, which means it would
likely be more idle than normal. Is there a good reason to decouple the
process of draining a bookie from the target bookie itself?

> For decommission, our thinking is that once we
> have tiered storage at the bookie level, the
> decommission story becomes a lot easier.

Do you see the bookkeeper tiered storage being used in every case? If not,
it could be valuable to make the decommissioning story independent of
tiered storage to let all users leverage this feature. Or, are you thinking
that the options would be to rely on either auto-recovery or tiered storage
to drain the bookies ledgers? As I mentioned above, it seems like a local
"draining" process could be valuable even when local auto-recovery is
turned off for the bookie.

- Michael

On Tue, Sep 7, 2021 at 8:05 AM Ivan Kelly <iv...@apache.org> wrote:

> Hi Yang,
>
> > Autoscaling is exactly one motivation for me to bring this topic up. I
> > understand that the auto-recovery is not perfect at the moment, but it's
> an
> > important component that maintains the core invariants of a bookkeeper
> > cluster, so I think we may keep improving it until we find a better
> > replacement.
>
> Internally we have replaced auto recovery with another mechanism that
> checks that the bookie
> has all the data it says it has. We have plans to push this upstream
> in the next month or two. A side
> effect of the change is that it allows you to run without journal safely.
> However, it doesn't cover the decommission usecase. For decommission,
> our thinking is that once we
> have tiered storage at the bookie level, the decommission story
> becomes a lot easier. Basically, you
> switch to read-only and wait for tiered storage to clear it out, even
> bumping the bookies ledgers in priority
> for offloading to tiered storage. We're still early in this process
> (utilization metrics have to come first).
>
> > I'm thinking maybe we can put the "draining" state as a special member in
> > the properties of `BookieServiceInfo
> > <
> https://github.com/apache/bookkeeper/blob/97818f5123999396e66f5246420d3c7e3d25f53d/bookkeeper-server/src/main/java/org/apache/bookkeeper/discover/BookieServiceInfo.java#L43
> >`,
> > and let the auditor check the properties of readonly bookies to see if a
> > bookie need to be drained and seen as unavailable.
>
> "draining" state is not something that anyone but the auditor needs to
> care about. It's not a state attribute of the bookie,
> but instead an external service's opinion of what should be happening
> with the bookie. as such, it doesn't belong in the bookie service info
> itself.
>
> It should go in some metadata that the auditor is maintaining about
> the bookie, so that when the auditor restarts, it can see that it
> should prioritize
> moving data from that bookie.
>
> > The bookie state API
> > <
> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/server/http/service/BookieStateService.java
> >
> > might
> > also be enhanced to support updating and persisting the state of a bookie
> > dynamically. And a new API might need to be added to check if all ledgers
> > have been moved off a "draining" bookie. Do you think these changes make
> > sense?
>
> This API is a bit of a hodge podge of different kinds of state.
> "shutting down", "running" and "availableForHighPriority" are
> transient states. We need read-only to be persistent if it is
> requested by an external entity. To persist the readonly state, there
> are multiple options. We can persist it to zookeeper somehow, or we
> can persist it as file on the bookie's disk. But the API linked isn't
> really suited in any case.
> What we need is a /bookie/state/readonly endpoint, where we PUT and
> payload like '{"readonly": true}'. When this gets a OK response, that
> state should be persisted, so any reboot will keep the bookie in the
> correct state.
>
> -Ivan
>

Re: Improve the process of removing bookies from a cluster

Reply via email to