For the short term, you can see if the `recover` command is able to help you: https://bookkeeper.apache.org/docs/reference/cli#bookkeeper-shell-recover
For the long term, I have proposed a solution to mark the bookie to be decommissioned in a `draining` state and let the autorecovery mechanism replicate the ledgers, please take a look and see if it could solve your use case: https://lists.apache.org/thread/1l9kzb1l0vok105gj2ody3g8nyv7s9l8 Best regards, Yang Yang On Thu, Aug 25, 2022 at 6:32 PM steven lu <lushiji2...@gmail.com> wrote: > I think this feature is somewhat custom and not very generic; and there are > risks: > 1. If you want to go offline on node A (which has already been extracted), > but wrong write B, this function will directly go offline on node B, which > is likely to cause online failures > 2. If the node to be offline suddenly accesses traffic, how should it be > handled? It is easy to cause the loss of cluster replicas > > In response to these two problems, how to avoid, please help explain > > lordcheng10 <1572139...@qq.com.invalid> 于2022年8月25日周四 17:12写道: > > > Hi Bookkeeper Community, > > > > > > This is a BP discussion on Support non-stop bookie data migration > and > > bookie offline > > The issue can be found: > > https://github.com/apache/bookkeeper/issues/3456 > > > > > > I copy the content here for convenience, any suggestions are welcome and > > appreciated. > > > > > > > > > > ### Motivation > > bookie offline steps: > > 1. Log on to the bookie node, check if there are underreplicated > > ledgers.If there are, the decommission command will force them to be > > replicated: bin/bookkeeper shell listunderreplicated > > 2. Stop the bookie : bin/bookkeeper-daemon.sh stop bookie > > 3. Run the decommission command. If you have logged onto the node you > wish > > to decommission, you don't need to provide -bookieid If you are running > the > > decommission command for target bookie node from another bookie node you > > should mention the target bookie id in the arguments for -bookieid > : > > bin/bookkeeper shell decommissionbookie or $ bin/bookkeeper shell > > decommissionbookie -bookieid <target bookieid> > > 4. Validate that there are no ledgers on decommissioned bookie $ > > bin/bookkeeper shell listledgers -bookieid <target bookieid> > > > > > > For the current bookie offline solution, need to stop the bookie > > first,execute the decommission command and wait for the ledger migration > on > > the bookie to complete. > > > > > > it is very time-consuming to offline a bookie node. When we need to > > offline a lot of bookie nodes, the time-consuming of this solution will > not > > be acceptable. > > > > > > Therefore, we need a solution that can migrate data without stopping > > bookie, so that bookie nodes can be offlined in batches. > > > > > > ### Proposal > > In order to solve this solution, we propose a solution that can be > > replicated without stopping the bookie. > > The process is as follows: > > 1. Submit the bookie node to be offline; > > 5. Traverse each ledgers on the offline bookie, and persist these ledgers > > and the corresponding offline bookie nodes to the zookeeper directory: > > ledgers/offline_ledgers/ledgerId; > > 6. Get the ledger to be offline; > > 7. Traverse all fragments on a ledger, and filter out the fragments > > containing the offline bookie copy; > > 8. Copy data for each fragment; > > 9. When a ledger fragment is copied, delete the corresponding > > ledgers/offline_ledgers/ledgerId; > > 10. When all ledgerId directories under ledgers/offline_ledgers are > > deleted, it means that the data has been migrated, you can stop bookies > in > > batches and go offline; > > > > > > To achieve our goal, we need to achieve two things: > > 1. Implement a command to submit the bookie to be offline and the > > corresponding ledgers, for example: > > bin/bookkeeper shell decommissionbookie -offline_bookieids > > bookieId1,bookieId2,bookieId3,...bookieIdN > > This command will write all ledgers on the offline bookie node to > > the zookeeper directory, for example: put > ledgers/offline_ledgers/ledgerId > > bookId1,bookId2,...bookIdn; > > 2. Design a ReassignLedgerWorker class to perform the actual ledger > > replication: > > this class will obtain a ledger from the zookeeper directory > > ledgers/offline_ledgers for replication. > > It will first filter out all the fragments containing the > > offline bookieId under the ledger,then copy these fragments; >