Re: [Discuss] BP-56: Support non-stop bookie data migration and bookie offline

steven lu Thu, 25 Aug 2022 03:32:50 -0700

I think this feature is somewhat custom and not very generic; and there are
risks:
1.  If you want to go offline on node A (which has already been extracted),
but wrong write B, this function will directly go offline on node B, which
is likely to cause online failures
2.  If the node to be offline suddenly accesses traffic, how should it be
handled? It is easy to cause the loss of cluster replicas


In response to these two problems, how to avoid, please help explain

lordcheng10 <1572139...@qq.com.invalid> 于2022年8月25日周四 17:12写道：

> Hi Bookkeeper Community,&nbsp;
>
>
> This is a BP discussion on&nbsp;Support non-stop bookie data migration and
> bookie offline
> The issue can be found:&nbsp;
> https://github.com/apache/bookkeeper/issues/3456&nbsp;
>
>
> I copy the content here for convenience, any suggestions are welcome and
> appreciated.
>
>
>
>
> ### Motivation
> bookie offline steps:
> 1. Log on to the bookie node, check if there are underreplicated
> ledgers.If there are, the decommission command will force them to be
> replicated: bin/bookkeeper shell listunderreplicated
> 2. Stop the bookie : bin/bookkeeper-daemon.sh stop bookie
> 3. Run the decommission command. If you have logged onto the node you wish
> to decommission, you don't need to provide -bookieid If you are running the
> decommission command for target bookie node from another bookie node you
> should mention the target bookie id in the arguments for -bookieid :&nbsp;
> bin/bookkeeper shell decommissionbookie or $ bin/bookkeeper shell
> decommissionbookie -bookieid <target bookieid&gt;
> 4. Validate that there are no ledgers on decommissioned bookie $
> bin/bookkeeper shell listledgers -bookieid <target bookieid&gt;
>
>
> For the current bookie offline solution, need to stop the bookie
> first，execute the decommission command and wait for the ledger migration on
> the bookie to complete.
>
>
> it is very time-consuming to offline a bookie node. When we need to
> offline a lot of bookie nodes, the time-consuming of this solution will not
> be acceptable.
>
>
> Therefore, we need a solution that can migrate data without stopping
> bookie, so that bookie nodes can be offlined in batches.
>
>
> ### Proposal
> In order to solve this solution, we propose a solution that can be
> replicated without stopping the bookie.&nbsp;
> The process is as follows:
> 1. Submit the bookie node to be offline;
> 5. Traverse each ledgers on the offline bookie, and persist these ledgers
> and the corresponding offline bookie nodes to the zookeeper directory:
> ledgers/offline_ledgers/ledgerId;
> 6. Get the ledger to be offline;
> 7. Traverse all fragments on a ledger, and filter out the fragments
> containing the offline bookie copy;
> 8. Copy data for each fragment;
> 9. When a ledger fragment is copied, delete the corresponding
> ledgers/offline_ledgers/ledgerId;
> 10. When all ledgerId directories under ledgers/offline_ledgers are
> deleted, it means that the data has been migrated, you can stop bookies in
> batches and go offline;
>
>
> To achieve our goal, we need to achieve two things:
> 1. Implement a command to submit the bookie to be offline and the
> corresponding ledgers, for example:
> bin/bookkeeper shell decommissionbookie -offline_bookieids&nbsp;
> bookieId1,bookieId2,bookieId3,...bookieIdN
> &nbsp; This command will write all ledgers on the offline bookie node to
> the zookeeper directory, for example: put ledgers/offline_ledgers/ledgerId
> bookId1,bookId2,...bookIdn;
> 2. Design a ReassignLedgerWorker class to perform the actual ledger
> replication:&nbsp;
> &nbsp; &nbsp;this class will obtain a ledger from the zookeeper directory
> ledgers/offline_ledgers for replication.&nbsp;
> &nbsp; &nbsp;It will first filter out all the fragments containing the
> offline bookieId under the ledger,then copy these fragments;

Re: [Discuss] BP-56: Support non-stop bookie data migration and bookie offline

Reply via email to