I have opened issue #3696, <http://subversion.tigris.org/issues/show_bug.cgi?id=3696>, to track this.
- Julian On Fri, 2010-07-30 at 16:17 +0100, Julian Foad wrote: > On Tue, 2010-06-22 at 13:57 +0100, Julian Foad wrote: > > I am investigating the following DAV + FSFS server error message which > > we at WANdisco have seen rarely, but in at least two real installations, > > and am sharing it here in case anyone can shed light on it. It happens > > during a DAV 'MERGE': > > > > Can't open file '[...]/db/transactions/props': No such file or directory > > > > FSFS is trying to open an invalid path; it should be > > '[...]/db/transactions/<txn-id>/props'. From the source code, I can see > > how this must happen at a low level: the txn-id path component will be > > omitted if and only if something passes a NULL transaction-id string to > > fs_fs.c:get_txn_proplist(). > > > > But why and from where? > > A status update: > > I've traced the calls to get_txn_proplist(), a few call levels up, and I > believe one of these must have supplied txn_id = NULL. > > Callees of get_txn_proplist(), with txn-id parameter identified by name > and by param-list position: > > get_txn_proplist(,,txn_id) > svn_fs_fs__change_txn_props(txn->id) # DOESN'T CAUSE THE ERROR > txn_vtable.change_props(txn->id) > svn_fs_fs__change_txn_prop(txn->id) > commit_body() > svn_fs_fs__begin_txn(using svn_fs_fs__create_txn()) > svn_fs_fs__begin_obliteration_txn(using ...) > svn_fs_fs__get_txn(txn_id) # CAN CAUSE THE ERROR > svn_fs_fs__open_txn(,,name) > svn_fs_t.vtable.open_txn(,,name) > svn_fs_open_txn(,,name) > dav_svn__abort_txn(,txn_name) > prep_working() # NO, txn_name != NULL > prep_private(comb->priv.root.txn_name) > DAV_RESOURCE_TYPE_PRIVATE > is_our_resource() # NO, txn name != NULL > open_txn(,,txn_name) > dav_svn__checkout() > merge() > dav_svn__hooks_vsn.merge > fs_merge(,,,target_root->txn) > svn_fs_root_t.vtable.merge(,,,target_root->txn) > svn_fs_merge(,,target_root->txn) > # no callers? > svn_fs_fs__get_txn_ids(,,,txn_name) > svn_fs_fs__dag_txn_root(,,txn_id) > root_node<-open_path<-get_dag(,root->txn) > svn_fs_fs__dag_txn_base_root(,,txn_id) > merge_changes(,,txn->id) > svn_fs_fs__commit_txn(,,txn->id) > svn_fs_txn_t.vtable.commit(,,txn->id) > svn_fs_commit_txn(,,txn->id) > fs_merge() > svn_fs_fs__dag_clone_root(,,txn_id) > mutable_root_node(,root->txn) > make_path_mutable(root->txn) > commit_body(baton->txn->id) # DOESN'T CAUSE THE ERROR > svn_fs_fs__commit(,,txn->id) > svn_fs_fs__commit_txn(,,txn->id) > svn_fs_fs__txn_proplist(,txn->id) # CAN CAUSE THE ERROR > svn_fs_txn_t.vtable.get_proplist(,txn->id) > svn_fs_txn_proplist(,,txn->id) > commit_body() > svn_fs_fs__txn_prop(,txn->id) > svn_fs_txn_t.vtable.get_prop(,txn->id) > svn_fs_txn_prop(,txn->id) > svn_fs_fs__txn_root(,txn->id) > svn_fs_txn_t.vtable.root(,txn->id) > svn_fs_txn_root(,txn->id) > > The ones annotated "CAN CAUSE THE ERROR" did result in the "Can't open > 'db/transactions/props'" error when I forced txn_id=NULL at these places > in a trivial test scenario, and the ones marked "DOESN'T", didn't. > > >From the fact that a DAV MERGE was the operation under way, I would > hazard a guess that dav_svn__hooks_vsn.merge() might be the call that's > going wrong. I haven't tried to trace that back to its caller (in > mod_dav perhaps?). > > I don't know if this line of attack has any mileage left in it. I'm > thinking of leaving it at that for now, unless anyone comes up with > suggestions for how to proceed. I don't feel capable of setting up a > test system that would replicate the sequence of DAV messages that > resulted in the error condition. > > (The issue of the current assertion leaving a DOS attack possible is > still in my sights - email thread "Re: svn commit: r980046".) > > - Julian > > > > The error might be triggered by the WANdisco software sending commands > > in the wrong sequence, or something like that, and so to that extent it > > may not be interesting to the general community. In one case, detailed > > below, it happened after a DELETE was sent, more than an hour into a > > very long commit. In another case, it happened after a start-commit > > hook failed. The server might be receiving a DAV MERGE request when it > > isn't expecting one. Can anyone comment on that, given this limited > > information? > > > > Here are some Apache access log entries from one occurrence, with paths > > anonymized and the client Id string (always "SVN/1.6.6 (r40053) > > neon/0.28.3") omitted for brevity: > > [[[ > > 127.0.0.1 - UserK [16/Dec/2009:20:25:17 -0800] > > "MKACTIVITY /svn/RepoJ/!svn/act/ActivityA HTTP/1.1" 201 312 "-" > > 127.0.0.1 - UserK [16/Dec/2009:23:20:45 -0800] > > "PROPPATCH /svn/RepoJ/!svn/wrk/ActivityA/branches/BranchD/PathG HTTP/1.1" > > 207 582 "-" > > 127.0.0.1 - UserM [17/Dec/2009:00:12:45 -0800] > > "MERGE /svn/RepoJ/branches/BranchH/PathC HTTP/1.1" 200 953 "-" > > 127.0.0.1 - UserM [17/Dec/2009:00:12:59 -0800] > > "MERGE /svn/RepoJ/branches/BranchP/PathC HTTP/1.1" 200 965 "-" > > 127.0.0.1 - UserK [17/Dec/2009:00:21:19 -0800] > > "DELETE /svn/RepoJ/!svn/act/ActivityA HTTP/1.1" 204 - "-" > > 127.0.0.1 - UserK [17/Dec/2009:00:32:45 -0800] > > "MKACTIVITY /svn/RepoJ/!svn/act/ActivityB HTTP/1.1" 201 312 "-" > > 127.0.0.1 - UserK [17/Dec/2009:00:39:58 -0800] > > "MERGE /svn/RepoJ/branches/BranchD/PathJ HTTP/1.1" 500 279 "-" > > 127.0.0.1 - UserK [17/Dec/2009:00:39:59 -0800] > > "PROPPATCH /svn/RepoJ/!svn/wrk/ActivityB/branches/BranchD/PathE HTTP/1.1" > > 207 569 "-" > > 127.0.0.1 - UserK [17/Dec/2009:00:39:59 -0800] > > "MERGE /svn/RepoJ/branches/BranchD/PathJ HTTP/1.1" 500 279 "-" > > ]]] > > > > and the corresponding error log: > > [[[ > > [Thu Dec 17 00:39:58 2009] [error] [client 127.0.0.1] > > Could not MERGE resource "/svn/RepoJ/!svn/act/ActivityA" into > > "/svn/RepoJ/branches/BranchD/PathJ". [500, #0] > > [Thu Dec 17 00:39:58 2009] [error] [client 127.0.0.1] > > There was a problem opening the transaction specified by this activity. > > [500, #2] > > [Thu Dec 17 00:39:58 2009] [error] [client 127.0.0.1] > > Can't open file '/svn/repos/RepoJ/db/transactions/props': No such file or > > directory [500, #2] > > ]]] > > > > The error appears to match the second-last MERGE in the access log. I > > don't have a copy of the body of the MERGE request in this case. > > > > The error refers to an activity Id that was the subject of the DELETE > > that was logged 18 minutes earlier. A comment from one of our guys > > looking at the logs: "We are seeing 22,600 changes for Activity A and no > > merge at the end (just a DELETE). Then we see about 931 commands for > > Activity B and a MERGE that fails for Activity A." > > > > Is this a case where the client side shouldn't be sending a MERGE at > > this point? Even if that is the primary problem, I'm still interested > > in Subversion's response to this MERGE request. > > > > I want to assert(txn-id != NULL) in the relevant FSFS functions, as in > > the attached patch. I determined the comments in the attached patch > > after manually passing NULL instead of the txn-id. However, that won't > > get me far on its own. > > > > To try to reproduce the error, do you think I should try sending a > > sequence of DAV commands similar to the sequence shown above? > > > > Any other ideas on what might be happening or how to diagnose this? > > > > - Julian > > > >