Il giorno lun 15 mag 2023 alle ore 10:50 Hang Chen <chenh...@apache.org> ha scritto: > > > When flushing, only the lastMark value will be persisted to the file, but > > the lastMark value will not be updated. > > The lastMark value is updated only when the ForceWriteRequest completes. So > > when the flush is triggered here, the position of lastMark is not 100MB's > > offset > Yes, you are right. But the persisted lastMark value will be near to > the latest lastMark value. Even though it is not 100MB's offset, it > may be 100MB - 1 offset. > > > The root cause of this bug is that EntryLogger1 triggers a checkpoint > when its write cache is full, updating both EntryLogger1 and > EntryLogger2's `lastMark` position. However, EntryLogger2's data may > still be in WriteCache, which may lead to data loss when the bookie > shutdown will `kill -9` > > There are two solutions for this bug. > #### Update `lastMark` position individually. > - When EntryLogger1 triggers the checkpoint, we only update > EntryLogger1's `lastMark` position instead of updating EntryLogger2's > `lastMark` position at the same time. > - When SyncThread triggers the checkpoint, we update all the > EntryLoggers' `lastMark` positions. > - When determining whether a journal file can be deleted, we should > get the smallest `lastMark` position among all the writeable > EntryLoggers, and delete the journal files which less than the > smallest `lastMark` position. > - When replying to the journal on bookie startups, we need to get the > smallest `lastMark` position, and reply to the journal files with this > position, otherwise, we will lose data. > > However, one case is hard to handle in replying to the journal stage. > When one ledger disk transfers from ReadOnly to Writable mode, the > `lastMark` position is an old value. Using the old position to reply > to the journal files will lead to a target journal file not found > exception. > > #### Only update `lastMark` position in SyncThread > Two places can trigger a checkpoint. > - The scheduled tasks in SyncThread.doCheckpoint > - The ledgerDir write-cache full and then flush > > The second way is the root cause of data loss if the ledger is > configured with multiple directories. > We can turn off the second way's update `lastMark` position operation > and only make SyncThread update the `lastMark` position in a > checkpoint when the ledger is configured with multiple directories. > > This is the simplest way to fix this bug, but it has two drawbacks. > - The `lastMark` position updates depend on SyncThread doing > checkpoint intervals. In Pulsar, the default interval is 60s. It means > the journal file expires with at least 60s > - The bookie startup replying to the journal depends on the `lastMark` > position. It means the journal will reply to at least 60s journal data > before the start-up is complete. It may lead to the bookie start-up > speed slowing down. > > IMO, the above two drawbacks can be acceptable compared to data loss.
I agree Thanks Enrico > > Thanks, > Hang > > Gavin Gao <zhangmin...@apache.org> 于2022年6月10日周五 13:31写道: > > > > The problem is: > > flush when writeCache is full is per ledger disks independent, but they > > share the same journal disk lastMark value. > > > > On 2022/06/07 04:16:58 lordcheng10 wrote: > > > > In flushing > > > the write cache, it will trigger a checkpoint to mark the journal’s > > > lastMark position (100MB’s offset) > > > > > > > > > When flushing, only the lastMark value will be persisted to the file, but > > > the lastMark value will not be updated. > > > The lastMark value is updated only when the ForceWriteRequest completes. > > > So when the flush is triggered here, the position of lastMark is not > > > 100MB's offset > > > > > > > > > I’m not sure whether I missed some logic. > > > > > > ------------------ 原始邮件 ------------------ > > > 发件人: "Hang Chen"<chenh...@apache.org>; > > > 发送时间: 2022年5月30日(星期一) 上午9:21 > > > 收件人: "dev"<dev@bookkeeper.apache.org>; > > > 主题: [Discuss] Bookie may lose data even though we turn on fsync for the > > > journal > > > > > > > > > > > > We found one place where the bookie may lose data even though we turn > > > on fsync for the journal. > > > Condition: > > > - One journal disk, and turn on fsync for the journal > > > - Configure two ledger disks, ledger1, and ledger2 > > > > > > Assume we write 100MB data into one bookie, 70MB data written into > > > ledger1's write cache, and 30 MB data written into ledger2's write > > > cache. Ledger1's write cache is full and triggers flush. In flushing > > > the write cache, it will trigger a checkpoint to mark the journal’s > > > lastMark position (100MB’s offset) and write the lastMark position > > > into both ledger1 and ledger2's lastMark file. > > > > > > At this time, this bookie shutdown without flush write cache, such as > > > shutdown by `kill -9` command, and ledger2's write cache (30MB) > > > doesn’t flush into ledger disk. But ledger2's lastMark position which > > > persisted into lastMark file has been updated to 100MB’s offset. > > > > > > When the bookie starts up, the journal reply position will be > > > `min(ledger1's lastMark, ledger2's lastMark)`, and it will be 100MB’s > > > offset. The ledger2's 30MB data won’t reply and that data will be > > > lost. > > > > > > Please help take a look. I’m not sure whether I missed some logic. > > > > > > Thanks, > > > Hang