> When flushing, only the lastMark value will be persisted to the file, but the 
> lastMark value will not be updated.
> The lastMark value is updated only when the ForceWriteRequest completes. So 
> when the flush is triggered here, the position of lastMark is not 100MB's 
> offset
Yes, you are right. But the persisted lastMark value will be near to
the latest lastMark value. Even though it is not 100MB's offset, it
may be 100MB - 1 offset.


The root cause of this bug is that EntryLogger1 triggers a checkpoint
when its write cache is full, updating both EntryLogger1  and
EntryLogger2's `lastMark` position. However, EntryLogger2's data may
still be in WriteCache, which may lead to data loss when the bookie
shutdown will `kill -9`

There are two solutions for this bug.
#### Update `lastMark` position individually.
- When EntryLogger1 triggers the checkpoint, we only update
EntryLogger1's `lastMark` position instead of updating EntryLogger2's
`lastMark` position at the same time.
- When SyncThread triggers the checkpoint, we update all the
EntryLoggers' `lastMark` positions.
- When determining whether a journal file can be deleted, we should
get the smallest `lastMark` position among all the writeable
EntryLoggers, and delete the journal files which less than the
smallest `lastMark` position.
- When replying to the journal on bookie startups, we need to get the
smallest `lastMark` position, and reply to the journal files with this
position, otherwise, we will lose data.

However, one case is hard to handle in replying to the journal stage.
When one ledger disk transfers from ReadOnly to Writable mode, the
`lastMark` position is an old value. Using the old position to reply
to the journal files will lead to a target journal file not found
exception.

#### Only update `lastMark` position in SyncThread
Two places can trigger a checkpoint.
- The scheduled tasks in SyncThread.doCheckpoint
- The ledgerDir write-cache full and then flush

The second way is the root cause of data loss if the ledger is
configured with multiple directories.
We can turn off the second way's update `lastMark` position operation
and only make SyncThread update the `lastMark` position in a
checkpoint when the ledger is configured with multiple directories.

This is the simplest way to fix this bug, but it has two drawbacks.
- The `lastMark` position updates depend on SyncThread doing
checkpoint intervals. In Pulsar, the default interval is 60s. It means
the journal file expires with at least 60s
- The bookie startup replying to the journal depends on the `lastMark`
position. It means the journal will reply to at least 60s journal data
before the start-up is complete. It may lead to the bookie start-up
speed slowing down.

IMO, the above two drawbacks can be acceptable compared to data loss.

Thanks,
Hang

Gavin Gao <zhangmin...@apache.org> 于2022年6月10日周五 13:31写道:
>
> The problem is:
> flush when writeCache is full is per ledger disks independent, but they share 
> the same journal disk lastMark value.
>
> On 2022/06/07 04:16:58 lordcheng10 wrote:
> > &gt; In flushing
> > the write cache, it will trigger a checkpoint to mark the journal’s
> > lastMark position (100MB’s offset)
> >
> >
> > When flushing, only the lastMark value will be persisted to the file, but 
> > the lastMark value will not be updated.
> > The lastMark value is updated only when the ForceWriteRequest completes. So 
> > when the flush is triggered here, the position of lastMark is not 100MB's 
> > offset
> >
> >
> > I’m not sure whether I missed some logic.
> >
> > ------------------&nbsp;原始邮件&nbsp;------------------
> > 发件人: "Hang Chen"<chenh...@apache.org&gt;;
> > 发送时间: 2022年5月30日(星期一) 上午9:21
> > 收件人: "dev"<dev@bookkeeper.apache.org&gt;;
> > 主题: [Discuss] Bookie may lose data even though we turn on fsync for the 
> > journal
> >
> >
> >
> > We found one place where the bookie may lose data even though we turn
> > on fsync for the journal.
> > Condition:
> > - One journal disk, and turn on fsync for the journal
> > - Configure two ledger disks, ledger1, and ledger2
> >
> > Assume we write 100MB data into one bookie, 70MB data written into
> > ledger1's write cache, and 30 MB data written into ledger2's write
> > cache. Ledger1's write cache is full and triggers flush. In flushing
> > the write cache, it will trigger a checkpoint to mark the journal’s
> > lastMark position (100MB’s offset) and write the lastMark position
> > into both ledger1 and ledger2's lastMark file.
> >
> > At this time, this bookie shutdown without flush write cache, such as
> > shutdown by `kill -9` command, and ledger2's write cache (30MB)
> > doesn’t flush into ledger disk. But ledger2's lastMark position which
> > persisted into lastMark file has been updated to 100MB’s offset.
> >
> > When the bookie starts up, the journal reply position will be
> > `min(ledger1's lastMark, ledger2's lastMark)`, and it will be 100MB’s
> > offset. The ledger2's 30MB data won’t reply and that data will be
> > lost.
> >
> > Please help take a look.&nbsp; I’m not sure whether I missed some logic.
> >
> > Thanks,
> > Hang

Reply via email to