Re: 回复：[Discuss] Bookie may lose data even though we turn on fsync for the journal

Enrico Olivelli Mon, 15 May 2023 02:04:50 -0700

Il giorno lun 15 mag 2023 alle ore 10:50 Hang Chen
<chenh...@apache.org> ha scritto:
>
> > When flushing, only the lastMark value will be persisted to the file, but 
> > the lastMark value will not be updated.
> > The lastMark value is updated only when the ForceWriteRequest completes. So 
> > when the flush is triggered here, the position of lastMark is not 100MB's 
> > offset
> Yes, you are right. But the persisted lastMark value will be near to
> the latest lastMark value. Even though it is not 100MB's offset, it
> may be 100MB - 1 offset.
>
>
> The root cause of this bug is that EntryLogger1 triggers a checkpoint
> when its write cache is full, updating both EntryLogger1  and
> EntryLogger2's `lastMark` position. However, EntryLogger2's data may
> still be in WriteCache, which may lead to data loss when the bookie
> shutdown will `kill -9`
>
> There are two solutions for this bug.
> #### Update `lastMark` position individually.
> - When EntryLogger1 triggers the checkpoint, we only update
> EntryLogger1's `lastMark` position instead of updating EntryLogger2's
> `lastMark` position at the same time.
> - When SyncThread triggers the checkpoint, we update all the
> EntryLoggers' `lastMark` positions.
> - When determining whether a journal file can be deleted, we should
> get the smallest `lastMark` position among all the writeable
> EntryLoggers, and delete the journal files which less than the
> smallest `lastMark` position.
> - When replying to the journal on bookie startups, we need to get the
> smallest `lastMark` position, and reply to the journal files with this
> position, otherwise, we will lose data.
>
> However, one case is hard to handle in replying to the journal stage.
> When one ledger disk transfers from ReadOnly to Writable mode, the
> `lastMark` position is an old value. Using the old position to reply
> to the journal files will lead to a target journal file not found
> exception.
>
> #### Only update `lastMark` position in SyncThread
> Two places can trigger a checkpoint.
> - The scheduled tasks in SyncThread.doCheckpoint
> - The ledgerDir write-cache full and then flush
>
> The second way is the root cause of data loss if the ledger is
> configured with multiple directories.
> We can turn off the second way's update `lastMark` position operation
> and only make SyncThread update the `lastMark` position in a
> checkpoint when the ledger is configured with multiple directories.
>
> This is the simplest way to fix this bug, but it has two drawbacks.
> - The `lastMark` position updates depend on SyncThread doing
> checkpoint intervals. In Pulsar, the default interval is 60s. It means
> the journal file expires with at least 60s
> - The bookie startup replying to the journal depends on the `lastMark`
> position. It means the journal will reply to at least 60s journal data
> before the start-up is complete. It may lead to the bookie start-up
> speed slowing down.
>
> IMO, the above two drawbacks can be acceptable compared to data loss.


I agree

Thanks
Enrico

>
> Thanks,
> Hang
>
> Gavin Gao <zhangmin...@apache.org> 于2022年6月10日周五 13:31写道：
> >
> > The problem is:
> > flush when writeCache is full is per ledger disks independent, but they 
> > share the same journal disk lastMark value.
> >
> > On 2022/06/07 04:16:58 lordcheng10 wrote:
> > > &gt; In flushing
> > > the write cache, it will trigger a checkpoint to mark the journal’s
> > > lastMark position (100MB’s offset)
> > >
> > >
> > > When flushing, only the lastMark value will be persisted to the file, but 
> > > the lastMark value will not be updated.
> > > The lastMark value is updated only when the ForceWriteRequest completes. 
> > > So when the flush is triggered here, the position of lastMark is not 
> > > 100MB's offset
> > >
> > >
> > > I’m not sure whether I missed some logic.
> > >
> > > ------------------&nbsp;原始邮件&nbsp;------------------
> > > 发件人: "Hang Chen"<chenh...@apache.org&gt;;
> > > 发送时间: 2022年5月30日(星期一) 上午9:21
> > > 收件人: "dev"<dev@bookkeeper.apache.org&gt;;
> > > 主题: [Discuss] Bookie may lose data even though we turn on fsync for the 
> > > journal
> > >
> > >
> > >
> > > We found one place where the bookie may lose data even though we turn
> > > on fsync for the journal.
> > > Condition:
> > > - One journal disk, and turn on fsync for the journal
> > > - Configure two ledger disks, ledger1, and ledger2
> > >
> > > Assume we write 100MB data into one bookie, 70MB data written into
> > > ledger1's write cache, and 30 MB data written into ledger2's write
> > > cache. Ledger1's write cache is full and triggers flush. In flushing
> > > the write cache, it will trigger a checkpoint to mark the journal’s
> > > lastMark position (100MB’s offset) and write the lastMark position
> > > into both ledger1 and ledger2's lastMark file.
> > >
> > > At this time, this bookie shutdown without flush write cache, such as
> > > shutdown by `kill -9` command, and ledger2's write cache (30MB)
> > > doesn’t flush into ledger disk. But ledger2's lastMark position which
> > > persisted into lastMark file has been updated to 100MB’s offset.
> > >
> > > When the bookie starts up, the journal reply position will be
> > > `min(ledger1's lastMark, ledger2's lastMark)`, and it will be 100MB’s
> > > offset. The ledger2's 30MB data won’t reply and that data will be
> > > lost.
> > >
> > > Please help take a look.&nbsp; I’m not sure whether I missed some logic.
> > >
> > > Thanks,
> > > Hang

Re: 回复：[Discuss] Bookie may lose data even though we turn on fsync for the journal

Reply via email to