Alexandre Oliva <ol...@gnu.org>: > I know very little about reposurgeon, but I'm concerned that, should we > make the conversion with it, and later identify e.g. missed branches, we > might be unable to make such an incremental recovery. Can anyone > alleviate my concerns and let me know we could indeed make such an > incremental recovery of a branch missed in the initial conversion, in > such a way that its commit history would be shared with that of the > already-converted branch it branched from?
Reposurgeon has a reparent command. If you have determined that a branch is detached or has an incorrect attachment point, patching the metadata of the root node to fix that is very easy. > Now, would it be too much of a burden to insist that the commit graphs > out of both conversions be isomorphic, and maybe mappings between the > commit ids (if they can't be made identical to begin with, that is) be > generated and shared, so that the results of both conversions can be > efficiently and mechanically compared (disregarding expected > differences) not only in terms of branch and tag names and commit > graphs, but also tree contents, commit messages and any other metadata? > Has anything like this been done yet? On the GCC repository, no. There are very serious practical problems with full verification of git against SVN stemming mainly from the fact that Subversion checkout on a respository of this size is extremely slow. IIRC Joseph at one point estimated a check time on the order of months due to that overhead alone. If you're talking about a commit-by-commit comparison between two conversions that assumes one or te other is correct, that is theoretically possible and - because git retrieval is much faster - could theoretically be done in a reasonable amount of time. But there is a lot of devil in the practical details. The reposurgeon suite once included a tool for such comparisons. Last year this happened: commit b8a609925ba70a6b68f9eda1d748eb667ad2fa59 Author: Eric S. Raymond <e...@thyrsus.com> Date: Fri Aug 24 12:40:46 2018 -0400 Retire repodiffer. Its only use case was checks against git-svn... ...which we now know to make such bad conversions that on larger than trivial repos the differ would be prohibitively noisy. Maxim's scripts probably make a better conversion than bare git-svn, because he uses git-svn only for linear basic blocks and thereby avoids its worst failure modes. In theory I could dust off repodiffer and apply it. That's in theory. In practice, on a repository this size I am not greatly optimistic about getting a result that could be interpreted by a Mark I brain. The reasons go beyond git-svn's brain damage to the same ontological-mismatch problems that make SVN-to-git conversion a headache in general. You might think at least there'd be a 1:1 correspondence between commits in the two conversions, but that's not going to be true for a couple of different reasons. 1. Split commits. Reposurgeon decomposes these into pieces one per git branch. I don't know what Maxim's scripts do. I think Joseph turned up that there are over a thousand of these in the GCC history. 2. There are three classes of commits in Subversion that don't really fit the git data model, (1) directory creation/deletion commits, (2) directory copy commits, (3) property changes with no associated blob. For each of these exceptional commits a converter to Git has a choice of dropping the commit, turning it into some sort of annotated tag, or leaving it in place as a zero-op commit (anomalous but not forbidden in the git model). It is pretty much guaranteed that different converters will make different choices about these, which will make for huge amounts of noise in your attempt at a diff. Checking for DAG isomorphism: again, theoretically possible, practically pretty daunting. It could be worse - general graph isomorphism is not even known to be polynomial-time - but in this case we can label corresponding commits with matching legacy IDs, which should make possible an isomorphism check in linear time with a trivial algorithm. Well, except for split commits. That one would be solvable, albeit painful. The real problem here would be mergeinfo links. It's not even obvious what "correct" mapping of mergeinfo links is, in general, due to the mismatch between Subversion's cherry-pick-based merge model and git's branch merging. Again, different converters will make different choices. Reconciling them would be not fun. There is another world of hurt lurking in "(disregarding expected differences)". How do you know what differences to expect? How are you going to specify them? What will interpret that spec? There is more months of work here - nasty, wearing toil, with no guarantee of a result with a decent signal-to-noise ratio. Even though I'm quite literally the best-qualified person on earth to do it, I flinch at the thought. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a>