Re: FSFS commit failure should release txn proto-rev lock

Julian Foad Tue, 28 Jul 2020 06:44:55 -0700

Ping: can anyone comment on this proposed patch?

- Julian



Julian Foad wrote:

CC'ing more possible experts Dmitry, Evgeny: any thoughts about whetherthis change makes sense for FSFS commit?
- Julian


Julian Foad wrote:
TL;DR: I propose a change to the FSFS commit-transaction function, torelease the proto-rev write lock if an error occurs while it has thislock.
The practical applications of this change are rather obscure, whichperhaps explains why it has not been needed before. In particular, itapparently is not needed for the way the rest of standard Subversiondrives FSFS, but may be needed for other users of FSFS. I have comeacross this case in WANdisco's replicator, but as there are otherpeculiarities in how that drives FSFS, let us not confuse the issue bylooking too closely at it. It appears the issue would apply to otherusers of FSFS too.
In the FSFS commit-transaction code path (in svn_fs_fs__commit) thereis a region where it acquires an exclusive write lock on the prototyperevision (proto-rev). There are cases where code in this region canfail, and there is no release of the lock in the error return path.That means if the calling process re-tries, the "writing" flag isstill set in the transaction object in memory, and this causes an"already locked" error.
In regular Subversion we abandon a transaction if it fails at thisstage, and so never hit the problem. There are failure modes where are-try could not succeed, notably after we move the proto-rev fileinto its final location, breaking the transaction; this case is calledout in comments in the function and will remain after this change.Abandoning the transaction is a safe and effective way to use FSFS.However, other users of FSFS may prefer to re-try in certain other cases.
The case WANdisco encountered is where some old repository corruption(SVN-4858) was detected and reported at some point in this coderegion. Although the commit would not be able to succeed, it wasimportant to them that the same error should be reported again duringa re-try, and what was observed instead was that the "already locked"error was thrown instead.
I suppose disk being temporarily inaccessible is one class of errorwhere a re-try might be successful.
The attached test and patch demonstrate and fix the problem.
This patch does not attempt to make it possible to re-try a failedcommit in all cases. Some remaining cases are noted in the patch logmessage which is repeated here:
[[[
Roll back the transaction lock state so re-trying a failing commitresults in the same error again instead of an "already locked" error.
The problem was that when a commit returned a failure from within thecode region where it held a transaction proto-rev lock, it did notunlock it. Although the FSFSWD replicator replaces the transactionfiles on disk, the lock status remained on the transaction object inmemory and a subsequent retry then failed with "already locked" errorinstead of the same failure mode as the first attempt.
The solution here is to reset the lock status of the in-memorytransaction object before returning a commit failure.
This implementation addresses cases where the commit fails and returnsan error (e.g. due to detecting repository corruption at this point,as in the case reported in NV-7983), and the lock can be successfullyunlocked using the regular unlock code path.
Cases not addressed:
* There are conceivable failure modes where this regular unlockingwould not succeed, e.g. disk files becoming inaccessible, and thispatch would not address those cases. These could perhaps be addressedby adding a lock clean-up function that ignores errors in clean-up,and using that instead of the regular unlocking code.
* This implementation does not address cases where the processcrashes in this code region. (In such cases the in-memory 'iswritable' flag would not be preserved anyway so that is out of scope.)
### NOT YET TESTED:
For FSFSWD, this implementation should also work where a failureoccurs after moving the proto-rev file to the final revision-filelocation. FSFSWD re-copies the transaction directory before re-trying,and so this should succeed. For regular FSFS, it does not addressthis case: a re-try in this case will fail to open the transactionproto-rev file.
### This patch includes debugging code.

* subversion/libsvn_fs_fs/transaction.c
   (TEST_COMMIT_FAIL): Debug code.
   (ERR_PROTO_REV_LOCKED): New macro.
(commit_body): Use the new macro to handle errors in the regionwhere a proto-rev lock is held, and unlock it in those cases.
]]]
My question is, does this change (without the debugging code) makesense as an improvement to FSFS?
- Julian

Re: FSFS commit failure should release txn proto-rev lock

Reply via email to