[
https://issues.apache.org/jira/browse/BOOKKEEPER-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265519#comment-15265519
]
ASF GitHub Bot commented on BOOKKEEPER-926:
-------------------------------------------
GitHub user merlimat opened a pull request:
https://github.com/apache/bookkeeper/pull/41
BOOKKEEPER-926: Compacted entries are not properly synced before upda…
…ting index
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/merlimat/bookkeeper bk-926
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/bookkeeper/pull/41.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #41
----
commit 42dff817b5e85a12101d9674b323f237ddeaba78
Author: Matteo Merli <[email protected]>
Date: 2016-04-29T21:30:25Z
BOOKKEEPER-926: Compacted entries are not properly synced before updating
index
----
> Compacted entries are not properly synced before updating index
> ---------------------------------------------------------------
>
> Key: BOOKKEEPER-926
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-926
> Project: Bookkeeper
> Issue Type: Bug
> Affects Versions: 4.3.1
> Reporter: Matteo Merli
> Assignee: Matteo Merli
> Fix For: 4.4.0
>
>
> I have identified a couple of issues in Bookie compaction code.
> h4. Compacted entries are not properly synced when index is updated
> When compacting, we read "active" entries from an entry log and we re-append
> to current entry log. After compacting a number of entries, by default 100K,
> or at the very end, we need to update the index pointing to the new entry log
> and offset.
> Before updating the index, we need to wait for this entries to be flushed and
> fsynced, otherwise a bookie crash might leave the index updated, pointing to
> an invalid offset.
> The current code that is supposed to wait until flushed is:
> {code:java}
> // GarbageCollectorThread.java:178
> EntryLocation lastOffset = offsets.get(offsets.size()-1);
> long lastOffsetLogId = EntryLogger.logIdForOffset(lastOffset.location);
> while (lastOffsetLogId < entryLogger.getLeastUnflushedLogId() && running) {
> synchronized (flushLock) {
> flushLock.wait(1000);
> }
> lastOffset = offsets.get(offsets.size()-1);
> lastOffsetLogId = EntryLogger.logIdForOffset(lastOffset.location);
> }
> // update the index
> {code}
> The condition {{lastOffsetLogId < entryLogger.getLeastUnflushedLogId()}} is
> wrong, because if the last compacted entry was written in an earlier entry
> log than the least unflushed log, it means that the entries are already
> flushed and thus we don't need to wait.
> In the normal case what happens is that {{lastOffsetLogId} is actually the
> current entryLog and it's equal to {{entryLogger.getLeastUnflushedLogId()}},
> so we don't wait. But, in this case the entries appended to the current
> entrylog are not flushed nor synced, hence the problem.
> h4. Exception during index flush
> Having an exception when updating the index, combined with the above issue,
> makes the bookie GC to stop indefinitely.
> What happens is that the offset list is not cleared, and at the next bookie
> GC iteration it will find the old compacted entries in that list, for which
> now the entryLogId is less than the current log id, and that makes the while
> loop to never exit.
> Another problem is that, any IOException during the index flush, will make
> the GC thread to bail out and it will not remove even the entry logs that
> were compacted and flushed. Next time, these entry logs will be compacted
> again.
> h4. Proposed solution
> I think the best solution is to trigger the {{entryLogger.flush()}} form the
> bookie GC thread before updating the indexes. That would simplify the code
> and I don't see any disadvantages in doing that.
> Another improvement would be to delete compacted entry logs individually, as
> soon as the compacted data is flush, without waiting the end of the whole
> compaction process.
> The advantages are :
> * If compaction stop halfway, at least we don't have to re-compact what we
> just compacted
> * Compaction won't require significant space overhead. Today a major
> compaction can end up reappending a large amount of data and then deleting
> all the entry logs at the very end, requiring twice the size of the active
> data set to be stored on disk at some point in time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)