Improve the process of removing bookies from a cluster

Yang Yang Sun, 05 Sep 2021 05:52:07 -0700

Hello everyone,

I have been using the bookkeeper as part of pulsar clusters for a while and
noticed that the process of decommissioning
<https://bookkeeper.apache.org/docs/latest/admin/decomission/> a bookie (or
the recover command
<https://bookkeeper.apache.org/docs/latest/reference/cli/>) is not very
operation friendly and has some limitations:
1. the bookie to be decommissioned has to be stopped first and ledgers on
the bookie will be unavailable immediately
- there will be a higher risk of data loss during the process as some
ledgers will be under-replicated for a while
- the load on the remaining nodes may increase immediately because of more
read requests to serve, including those to recover under-replicated ledgers
- the process doesn't work for ledgers with an ensemble size as 1
2. the decommission has to be performed one after another in sequence to
avoid data loss
- some ledgers might be replicated multiple times when removing multiple
bookies from the cluster
3. the re-replication is performed on the node that executing the
decommission command
- it might be more efficient and safe to leverage the auto-recovery system
and benefit from the improvements on it, e.g., auto-scaling, replication
throttling, etc.


I think it would be better to have an improved process that could
re-replicate the ledgers from the bookies to be removed in parallel while
they are still active in the cluster. It would make it much easier and safe
to operate a large-scale bookkeeper cluster. I found that there is a
bookkeeper proposal draft named "BP-4 - BookKeeper Lifecycle Management
<https://cwiki.apache.org/confluence/display/BOOKKEEPER/BP-4+-+BookKeeper+Lifecycle+Management>"
that was trying to address this issue but it has not been accepted or
implemented yet:
- add a new bookie state called `draining` which is similar to the
`readonly` state that it is still able to serve read requests but no new
ledgers could be allocated onto it, while the auditor will see it as 'lost'
and generate re-replication tasks for all ledgers on it.
- once all ledgers on the `draining` bookie are fully replicated the bookie
is safe to be removed from the cluster.
- REST APIs should be added
  - to update the bookie state dynamically.
  - to query whether all ledgers on the bookie have been drained

I'm not sure if this proposal or similar issues have been discussed, it
seems to me that there won't be too much change in the code while the
benefits would be significant. Any comments or suggestions are welcome, and
I could spend some time working on it if it is viable, thanks!

Best regards,
Yang Yang

Improve the process of removing bookies from a cluster

Reply via email to