Yang, Il Sab 27 Ago 2022, 11:05 Yang Yang <fantaps...@gmail.com> ha scritto:
> For the short term, you can see if the `recover` command is able to help > you: > https://bookkeeper.apache.org/docs/reference/cli#bookkeeper-shell-recover > > For the long term, I have proposed a solution to mark the bookie to be > decommissioned in a `draining` state and let the autorecovery mechanism > replicate the ledgers, please take a look and see if it could solve your > use case: https://lists.apache.org/thread/1l9kzb1l0vok105gj2ody3g8nyv7s9l8 Would you be able to continue that proposal? Enrico > > Best regards, > Yang Yang > > > On Thu, Aug 25, 2022 at 6:32 PM steven lu <lushiji2...@gmail.com> wrote: > > > I think this feature is somewhat custom and not very generic; and there > are > > risks: > > 1. If you want to go offline on node A (which has already been > extracted), > > but wrong write B, this function will directly go offline on node B, > which > > is likely to cause online failures > > 2. If the node to be offline suddenly accesses traffic, how should it be > > handled? It is easy to cause the loss of cluster replicas > > > > In response to these two problems, how to avoid, please help explain > > > > lordcheng10 <1572139...@qq.com.invalid> 于2022年8月25日周四 17:12写道: > > > > > Hi Bookkeeper Community, > > > > > > > > > This is a BP discussion on Support non-stop bookie data migration > > and > > > bookie offline > > > The issue can be found: > > > https://github.com/apache/bookkeeper/issues/3456 > > > > > > > > > I copy the content here for convenience, any suggestions are welcome > and > > > appreciated. > > > > > > > > > > > > > > > ### Motivation > > > bookie offline steps: > > > 1. Log on to the bookie node, check if there are underreplicated > > > ledgers.If there are, the decommission command will force them to be > > > replicated: bin/bookkeeper shell listunderreplicated > > > 2. Stop the bookie : bin/bookkeeper-daemon.sh stop bookie > > > 3. Run the decommission command. If you have logged onto the node you > > wish > > > to decommission, you don't need to provide -bookieid If you are running > > the > > > decommission command for target bookie node from another bookie node > you > > > should mention the target bookie id in the arguments for -bookieid > > : > > > bin/bookkeeper shell decommissionbookie or $ bin/bookkeeper shell > > > decommissionbookie -bookieid <target bookieid> > > > 4. Validate that there are no ledgers on decommissioned bookie $ > > > bin/bookkeeper shell listledgers -bookieid <target bookieid> > > > > > > > > > For the current bookie offline solution, need to stop the bookie > > > first,execute the decommission command and wait for the ledger > migration > > on > > > the bookie to complete. > > > > > > > > > it is very time-consuming to offline a bookie node. When we need to > > > offline a lot of bookie nodes, the time-consuming of this solution will > > not > > > be acceptable. > > > > > > > > > Therefore, we need a solution that can migrate data without stopping > > > bookie, so that bookie nodes can be offlined in batches. > > > > > > > > > ### Proposal > > > In order to solve this solution, we propose a solution that can be > > > replicated without stopping the bookie. > > > The process is as follows: > > > 1. Submit the bookie node to be offline; > > > 5. Traverse each ledgers on the offline bookie, and persist these > ledgers > > > and the corresponding offline bookie nodes to the zookeeper directory: > > > ledgers/offline_ledgers/ledgerId; > > > 6. Get the ledger to be offline; > > > 7. Traverse all fragments on a ledger, and filter out the fragments > > > containing the offline bookie copy; > > > 8. Copy data for each fragment; > > > 9. When a ledger fragment is copied, delete the corresponding > > > ledgers/offline_ledgers/ledgerId; > > > 10. When all ledgerId directories under ledgers/offline_ledgers are > > > deleted, it means that the data has been migrated, you can stop bookies > > in > > > batches and go offline; > > > > > > > > > To achieve our goal, we need to achieve two things: > > > 1. Implement a command to submit the bookie to be offline and the > > > corresponding ledgers, for example: > > > bin/bookkeeper shell decommissionbookie -offline_bookieids > > > bookieId1,bookieId2,bookieId3,...bookieIdN > > > This command will write all ledgers on the offline bookie node > to > > > the zookeeper directory, for example: put > > ledgers/offline_ledgers/ledgerId > > > bookId1,bookId2,...bookIdn; > > > 2. Design a ReassignLedgerWorker class to perform the actual ledger > > > replication: > > > this class will obtain a ledger from the zookeeper > directory > > > ledgers/offline_ledgers for replication. > > > It will first filter out all the fragments containing the > > > offline bookieId under the ledger,then copy these fragments; > > >