Hi Yang,

> Autoscaling is exactly one motivation for me to bring this topic up. I
> understand that the auto-recovery is not perfect at the moment, but it's an
> important component that maintains the core invariants of a bookkeeper
> cluster, so I think we may keep improving it until we find a better
> replacement.

Internally we have replaced auto recovery with another mechanism that
checks that the bookie
has all the data it says it has. We have plans to push this upstream
in the next month or two. A side
effect of the change is that it allows you to run without journal safely.
However, it doesn't cover the decommission usecase. For decommission,
our thinking is that once we
have tiered storage at the bookie level, the decommission story
becomes a lot easier. Basically, you
switch to read-only and wait for tiered storage to clear it out, even
bumping the bookies ledgers in priority
for offloading to tiered storage. We're still early in this process
(utilization metrics have to come first).

> I'm thinking maybe we can put the "draining" state as a special member in
> the properties of `BookieServiceInfo
> <https://github.com/apache/bookkeeper/blob/97818f5123999396e66f5246420d3c7e3d25f53d/bookkeeper-server/src/main/java/org/apache/bookkeeper/discover/BookieServiceInfo.java#L43>`,
> and let the auditor check the properties of readonly bookies to see if a
> bookie need to be drained and seen as unavailable.

"draining" state is not something that anyone but the auditor needs to
care about. It's not a state attribute of the bookie,
but instead an external service's opinion of what should be happening
with the bookie. as such, it doesn't belong in the bookie service info
itself.

It should go in some metadata that the auditor is maintaining about
the bookie, so that when the auditor restarts, it can see that it
should prioritize
moving data from that bookie.

> The bookie state API
> <https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/server/http/service/BookieStateService.java>
> might
> also be enhanced to support updating and persisting the state of a bookie
> dynamically. And a new API might need to be added to check if all ledgers
> have been moved off a "draining" bookie. Do you think these changes make
> sense?

This API is a bit of a hodge podge of different kinds of state.
"shutting down", "running" and "availableForHighPriority" are
transient states. We need read-only to be persistent if it is
requested by an external entity. To persist the readonly state, there
are multiple options. We can persist it to zookeeper somehow, or we
can persist it as file on the bookie's disk. But the API linked isn't
really suited in any case.
What we need is a /bookie/state/readonly endpoint, where we PUT and
payload like '{"readonly": true}'. When this gets a OK response, that
state should be persisted, so any reboot will keep the bookie in the
correct state.

-Ivan

Reply via email to