Hi Mike and Julian, Sorry for the delayed response; been working on/thinking about other mergey stuff lately...
A few inline comments and then a proposed course of action at the end. On Thu, Apr 15, 2010 at 9:28 AM, C. Michael Pilato <cmpil...@collab.net> wrote: > > The question then becomes, "How do users deal with legitimate partial dumps > that will be loaded atop something other than loads from previous > incremental windows?" I think they do this the same way they handle the > copy case: either hand-hacking the dumpstream or using 'svndumpfilter'. So > maybe 'svndumpfilter' grows the logic and options required to pare away > mergeinfo that doesn't meet some criteria (such as "ranges outside the dump > stream range")? I vote for svndumpfilter growing the logic to filter ranges predating the start of the dump stream in the same way it strips mergeinfo with merge sources filtered from the dump stream. Basically this means moving the r927243 functionality to svndumpfilter. No really sure we need a new option, seems we could overload --skip-missing-merge-sources to include this behavior. Thoughts? > Julian Foad wrote: >> On Wed, 2010-04-14, C. Michael Pilato wrote: >>> Assuming similar behavior for mergeinfo handling, we have, at a minimum: >>> >>> 1. 'svnadmin dump' warns when it is in incremental mode and must generate >>> mergeinfo from a merge source that predates the beginning of the dump >>> window, but it's only a warning and the dump continues. Why only --incremental mode? We should warn for both --incremental dumps *and* non-incremental dumps because in the latter case we might be specifying a dump range -rX:Y where rX>0* and there might be mergeinfo that refers to revs < X. And while we might want to be clever about skipping the warning when no range is specified (i.e. defaulting to -r0:HEAD), even then we should warn because issue #3020 might already have resulted in a r1 with mergeinfo in the starting repository. * Admittedly we could probably choose some other minimum rev, maybe r3 (which is the minimum numbers of revisions I can think to create mergeinfo; r1 - populate the repos, r2 - create a branch and make some changes on the branch parent, r3 - merge r2 to the branch. >> +1 on that, both for consistency with copy-source info and because it's >> a good policy anyway. >> >> A detail: I haven't tried it, but would want it to print this warning >> (and the copy-source one) at the end of the dump, and not just in the >> middle of the long list of "Dumping rX" messages. > > Currently the warnings are all "inline". But I agree that it would be > beneficial to also just track boolean flags (had_suspect_copies, > had_suspect_mergeinfo) throughout the dump and repeat warnings at the end: > > if had_suspect_copies: > print "WARNING: The range of revisions dumped contained references " > "to copy sources outside that range." > if had_suspect_mergeinfo: > print "WARNING: The range of revisions dumped contained references " > "to mergeinfo outside that range." > >>> 2. 'svnadmin load' does nothing smart, trusting that the dump it's being >>> fed is a sensible one. >> >> Better: 'svnadmin load' tries to validate the mergeinfo, and issues a >> warning if it refers to a non-existent source. The admin then gets to >> figure out whether it was bad in the first place or because of his/her >> partial-dump/partial-load scenario. >> >> However, it depends how efficient the checking is. If that would make >> the 'load' really slow, I can see that not being wanted. I created a test repository with 1000 text changes, then 1000 merges of each change, one at a time, to a branch. Then I dumped (non-incrementally) the first 1000 revs to a dump file and the second 1000 revs to a second dump file. I loaded the first dump file into two empty repositories. Then I used trunk build of svnadmin to load the second dump file into the first repos. This took 00:03:28.890. Then I used svnadmin with the attached patch to do the same with the second repository. The patch uses svn_fs_check_path() to check if each merge source-revision pair in the load stream which predates the first rev of the load stream, already exists in the target repository (which in this example they all do, there will be no filtering). No unsurprisingly, this took significantly (62%) longer, 00:05:39.609. This patch is fairly inefficient since it checks the same path-rev pairs multiple times. E.g.: We load a dump file with orginignal revisions start at r50. At r100 we encounter our first mergeinfo, '/trunk:r50' on some path. We check that tr...@50 exists, it does, and we allow the mergeinfo. Then we load r101, the mergeinfo on the same path is now '/trunk:50-51'. We check that tr...@50 and tr...@51 both exist...and so on. We need to leverage that first check of tr...@50 and not make it again (and again and again). Keep a mergeinfo catalog of checked paths? That would work, but memory use might be a problem with big repositories... ...So it can be made faster, but before I go further with it, does svn_fs_check_path() seem the right way to go? I'm no FS guru, so maybe there is a better way that is not obvious to me. > Agreed. > > -- > C. Michael Pilato <cmpil...@collab.net> > CollabNet <> www.collab.net <> Distributed Development On Demand Taking what you've had to say into consideration, I propose the following: 1) Revert http://svn.apache.org/viewvc?view=revision&revision=927243. This stops svnadmin load from automatically filtering mergeinfo referring to revisions outside of the load stream. 2) Reapply or reimplement the part of r327243 that accounts for the case where the first loaded revision in a load stream has mergeinfo, see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc10 item #1. 3) Keep the inline copy-source warnings as they exist today in svnadmin dump, but add a new end-of-process summary warning e.g.: * Dumped revision 504. * Dumped revision 505. * Dumped revision 506. * Dumped revision 507. WARNING: Referencing data in revision 250, which is older than the oldest WARNING: dumped revision (504). Loading this dump into an empty repository WARNING: will fail. * Dumped revision 58. * Dumped revision 59. * Dumped revision 510. * Dumped revision 511. . . . WARNING: The range of revisions dumped contained references to copy sources outside that range. 4) Implement inline and summary functionality analogous to 3) but for mergeinfo ranges outside of the dump stream, e.g.: * Dumped revision 504. * Dumped revision 505. * Dumped revision 506. * Dumped revision 507. WARNING: Mergeinfo referencing revision(s) prior to revision 250, which is older WARNING: than the oldest dumped revision (504). Loading this dump may result WARNING: in invalid mergeinfo. * Dumped revision 58. * Dumped revision 59. * Dumped revision 510. * Dumped revision 511. . . . WARNING: The range of revisions dumped contained references to mergeinfo outside that range. Also, as mentioned earlier, I think this warning should kick in regardless of whether --incremental is used. 5) Overload svndumpfilter's --skip-missing-merge-sources option to also remove mergeinfo that predates the starting revision of the dump stream. 6) Open: How does svnadmin load deal with mergeinfo that doesn't exist in the load steam or the target repos: A) It doesn't. B) Only checks that the revs are valid (fast, but can allow mergeinfo that describes invalid mergesource:rev pairs) C) Find an efficient way to test for valid mergesource:rev pairs Paul
Index: subversion/libsvn_repos/load.c =================================================================== --- subversion/libsvn_repos/load.c (revision 936046) +++ subversion/libsvn_repos/load.c (working copy) @@ -277,20 +277,98 @@ { apr_pool_t *subpool = svn_pool_create(pool); apr_hash_t *mergeinfo; + svn_mergeinfo_t outside_mergeinfo; apr_hash_t *final_mergeinfo = apr_hash_make(subpool); apr_hash_index_t *hi; SVN_ERR(svn_mergeinfo_parse(&mergeinfo, initial_val->data, subpool)); - /* Issue #3020: If the dump stream represents only part of a repository, - then mergeinfo in the stream may refer to revisions outside of the - stream. Remove any such invalid ranges before renumbering. */ - SVN_ERR(svn_mergeinfo__filter_mergeinfo_by_ranges( - &mergeinfo, mergeinfo, - rb->pb->youngest_old_rev, rb->pb->oldest_old_rev - 1, - TRUE, /* Allow only references to revs that exist in the load stream. */ - subpool, subpool)); + if (apr_hash_count(rb->pb->rev_map) > 0) + { + /* Find any mergeinfo which predates the start of the dump load stream. */ + SVN_ERR(svn_mergeinfo__filter_mergeinfo_by_ranges( + &outside_mergeinfo, mergeinfo, + rb->pb->youngest_old_rev, rb->pb->oldest_old_rev - 1, + FALSE, /* Allow only references to revs that exist in the load stream. */ + subpool, subpool)); + /* Issue #3020: If the dump stream represents only part of a repository, + then mergeinfo in the stream may refer to revisions outside of the + stream. Remove any such invalid ranges before renumbering. */ + SVN_ERR(svn_mergeinfo__filter_mergeinfo_by_ranges( + &mergeinfo, mergeinfo, + rb->pb->youngest_old_rev, rb->pb->oldest_old_rev - 1, + TRUE, /* Allow only references to revs that exist in the load stream. */ + subpool, subpool)); + } + else /* No mapping, so this is the first rev loaded. */ + { + outside_mergeinfo = mergeinfo; + mergeinfo = apr_hash_make(subpool); + } + /* Check the mergeinfo that predates the load stream. If it refers + to path-revs that exist in the target repository keep it, + otherwise drop it. */ + if (apr_hash_count(outside_mergeinfo)) + { + svn_mergeinfo_t all_keeper_mergeinfo = apr_hash_make(subpool); + + for (hi = apr_hash_first(subpool, outside_mergeinfo); + hi; + hi = apr_hash_next(hi)) + { + const char *merge_source = svn__apr_hash_index_key(hi); + apr_array_header_t *rangelist = svn__apr_hash_index_val(hi); + svn_node_kind_t merge_source_kind; + struct parse_baton *pb = rb->pb; + int i; + svn_fs_root_t *root_p; + svn_merge_range_t *range; + + for (i = 0; i < rangelist->nelts; i++) + { + int y; + + range = APR_ARRAY_IDX(rangelist, i, svn_merge_range_t *); + + for (y = range->start + 1; y <= range->end; y++) + { + svn_error_t *err = svn_fs_revision_root(&root_p, pb->fs, + y, subpool); + if (err) + { + if (err->apr_err == SVN_ERR_FS_NO_SUCH_REVISION) + { + svn_error_clear(err); + continue; + } + return err; + } + + SVN_ERR(svn_fs_check_path(&merge_source_kind, + rb->txn_root, + merge_source, subpool)); + if (merge_source_kind != svn_node_none) + { + apr_array_header_t *keeper_rangelist = + svn_rangelist__initialize(range->start, range->end, + range->inheritable, + subpool); + svn_mergeinfo_t keeper_mergeinfo = + apr_hash_make(subpool); + apr_hash_set(keeper_mergeinfo, merge_source, + APR_HASH_KEY_STRING, keeper_rangelist); + SVN_ERR(svn_mergeinfo_merge(all_keeper_mergeinfo, + keeper_mergeinfo, + subpool)); + } + } + } + } + if (apr_hash_count(all_keeper_mergeinfo)) + SVN_ERR(svn_mergeinfo_merge(mergeinfo, all_keeper_mergeinfo, subpool)); + } + for (hi = apr_hash_first(subpool, mergeinfo); hi; hi = apr_hash_next(hi)) { const char *merge_source; @@ -1250,32 +1328,22 @@ if (strcmp(name, SVN_PROP_MERGEINFO) == 0) { - /* If we haven't yet committed any revisions then *any* mergeinfo - will refer to the wrong revisions or revisions that don't exist. - Either way none of this mergeinfo can be valid, so remove it all. */ - if (apr_hash_count(rb->pb->rev_map) == 0) + /* Renumber mergeinfo as appropriate. */ + svn_string_t *renumbered_mergeinfo; + + SVN_ERR(renumber_mergeinfo_revs(&renumbered_mergeinfo, value, rb, + nb->pool)); + value = renumbered_mergeinfo; + if (parent_dir) { - value = NULL; + /* Prefix the merge source paths with PARENT_DIR. */ + /* ASSUMPTION: All source paths are included in the dump + stream. */ + svn_string_t *mergeinfo_val; + SVN_ERR(prefix_mergeinfo_paths(&mergeinfo_val, value, + parent_dir, nb->pool)); + value = mergeinfo_val; } - else - { - /* Renumber mergeinfo as appropriate. */ - svn_string_t *renumbered_mergeinfo; - - SVN_ERR(renumber_mergeinfo_revs(&renumbered_mergeinfo, value, rb, - nb->pool)); - value = renumbered_mergeinfo; - if (parent_dir) - { - /* Prefix the merge source paths with PARENT_DIR. */ - /* ASSUMPTION: All source paths are included in the dump - stream. */ - svn_string_t *mergeinfo_val; - SVN_ERR(prefix_mergeinfo_paths(&mergeinfo_val, value, - parent_dir, nb->pool)); - value = mergeinfo_val; - } - } } return svn_fs_change_node_prop(rb->txn_root, nb->path,