Hi Hang, The thing is that the BookKeeper replication protocol doesn't tolerate bookies losing entries that it says it has stored safely. Ledger recovery can end-up truncating ledgers leading to unrecoverable data loss that not even the auditor check can recover. So this shrink and expand is fundamentally unsafe.
Ivan Kelly and I have worked on making BK run without the journal, which can also lead to a bookie losing entries it said it had stored safely. This required some changes to make it safe from ledger truncation during ledger recovery and also allow bookies to self repair themselves. I will be starting to submit PRs for this work this week. Once those changes are in we could look at utilising it to make the expand/shrink operations safe. The alternative is to do the ledger rewriting to ensure that existing ledgers are placed in the correct directories before the bookie completes its boot process. Jack On Tue, Nov 2, 2021 at 5:17 AM Hang Chen <chenh...@apache.org> wrote: > [ External sender. Exercise caution. ] > > Hi Jack, > Currently, if we use multi directories for journal or ledger in > one bookie, it will store specific ledger into target directory by > `ledgerId % numberOfLedgers`. If we expand or shrink the ledgers or > journal directories, it will break hash result value, which will lead > to some ledgers can't find the target storage directory instance and > read ledger failed. The case can be addressed by auditor check. > In production BookKeeper cluster, if we use multi directories for > journal or ledger in one bookie, and disk errors occur, it will lead > to bookie shut down and can't startup unless we shrink the error disk > for configuration. After the error disk came back, we should expand > the disk to the bookie. > > Thanks, > Hang > > Jack Vanlightly <jvanligh...@splunk.com.invalid> 于2021年11月1日周一 下午6:15写道: > > > > Hi all, > > > > I thought I'd test the PR https://github.com/apache/bookkeeper/pull/2871 > as > > I hadn't used storage expansion at all. It seemed to work but I ran a > > correctness test just in case and found that it "lost" 50% of my ledgers. > > > > Looking at the code to my surprise it does not repartition the data > across > > the directories, which explained why 50% of the ledgers were "gone". I > > expanded from one to two ledger dirs, so all the even ledger ids were > fine, > > but the odd ledger id read operations got routed to the new directory > which > > of course was empty. All the ledger data was still all in the original > > ledger directory. > > > > So either I am not understanding the use case for storage expansion (i.e. > > you can only do it on an empty bookie) or this feature is majorly flawed. > > > > Please confirm either way. I'll create an issue, if it is indeed flawed. > > > > Jack >