Hi Yang, > Autoscaling is exactly one motivation for me to bring this topic up. I > understand that the auto-recovery is not perfect at the moment, but it's an > important component that maintains the core invariants of a bookkeeper > cluster, so I think we may keep improving it until we find a better > replacement.
Internally we have replaced auto recovery with another mechanism that checks that the bookie has all the data it says it has. We have plans to push this upstream in the next month or two. A side effect of the change is that it allows you to run without journal safely. However, it doesn't cover the decommission usecase. For decommission, our thinking is that once we have tiered storage at the bookie level, the decommission story becomes a lot easier. Basically, you switch to read-only and wait for tiered storage to clear it out, even bumping the bookies ledgers in priority for offloading to tiered storage. We're still early in this process (utilization metrics have to come first). > I'm thinking maybe we can put the "draining" state as a special member in > the properties of `BookieServiceInfo > <https://github.com/apache/bookkeeper/blob/97818f5123999396e66f5246420d3c7e3d25f53d/bookkeeper-server/src/main/java/org/apache/bookkeeper/discover/BookieServiceInfo.java#L43>`, > and let the auditor check the properties of readonly bookies to see if a > bookie need to be drained and seen as unavailable. "draining" state is not something that anyone but the auditor needs to care about. It's not a state attribute of the bookie, but instead an external service's opinion of what should be happening with the bookie. as such, it doesn't belong in the bookie service info itself. It should go in some metadata that the auditor is maintaining about the bookie, so that when the auditor restarts, it can see that it should prioritize moving data from that bookie. > The bookie state API > <https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/server/http/service/BookieStateService.java> > might > also be enhanced to support updating and persisting the state of a bookie > dynamically. And a new API might need to be added to check if all ledgers > have been moved off a "draining" bookie. Do you think these changes make > sense? This API is a bit of a hodge podge of different kinds of state. "shutting down", "running" and "availableForHighPriority" are transient states. We need read-only to be persistent if it is requested by an external entity. To persist the readonly state, there are multiple options. We can persist it to zookeeper somehow, or we can persist it as file on the bookie's disk. But the API linked isn't really suited in any case. What we need is a /bookie/state/readonly endpoint, where we PUT and payload like '{"readonly": true}'. When this gets a OK response, that state should be persisted, so any reboot will keep the bookie in the correct state. -Ivan