It's not true that nobody is working on this. I have been working on the
SVN dump in the meantime. You would not believe how incredibly complex the
process of processing that (remote) dump is. Let me highlight a few key
issues:

1) There is no "one" Lucene SVN repository that can be transferred to git.
The history is a mess. Trunk, branches, tags -- all change paths at various
points in history. Entire projects are copied from *outside* the official
Lucene ASF path (when Solr, Nutch or Tika moved from the incubator, for
example).

2) The history of commits to Lucene's subpath of the SVN is ~50k commits.
ASF's commit history in which those 50k commits live is 1.8 *million*
commits. I think the git-svn sync crashes due to the sheer number of
(empty) commits in between actual changes.

3) There are a few commits that are gigantic. I mentioned Grant's 1.2G
patch, for example, but there are others (the second larger is 190megs, the
third is 136 megs).

4) The size of JARs is really not an issue. The entire SVN repo I mirrored
locally (including empty interim commits to cater for svn:mergeinfos) is
4G. If you strip the stuff like javadocs and side projects (Nutch, Tika,
Mahout) then I bet the entire history can fit in 1G total. Of course
stripping JARs is also doable.

5) There is lots of junk at the main SVN path so you can't just version the
top-level folder. If you wanted to checkout /asf/lucene then the size of
the resulting folder is enormous -- I terminated the checkout after I
reached over 20 gigs. Well, technically you *could* do it, it'd preserve
perfect history, but I wouldn't want to git co a past version that checks
out all the tags, branches, etc. This has to be mapped in a sensible way.

What I think is that all the above makes (straightforward) conversion to
git problematic. Especially moving paths are a problem -- how to mark tags/
branches, where the main line of development is, etc. This conversion would
have to be guided and hand-tuned to make sense. This effort would only pay
for itself if we move to git, otherwise I don't see the benefit. Paul's
script is fine for keeping short-term history.

Dawid

P.S. Either the SVN repo at Apache is broken or the SVN is broken, which
makes processing SVN history even more fun. This dump indicates Tika being
moved from the incubator to Lucene:

svnrdump dump -r 712381 --incremental https://svn.apache.org/repos/asf/ >
out

But when you dump just Lucene's subpath, the output is broken (last
changeset in the file is an invalid changeset, it carries no target):

svnrdump dump -r 712381 --incremental
https://svn.apache.org/repos/asf/lucene > out



On Tue, Dec 15, 2015 at 6:04 PM, Yonik Seeley <[email protected]> wrote:

> If we move to git, stripping out jars seems to be an independent decision?
> Can you even strip out jars and preserve history (i.e. not change
> hashes and invalidate everyone's forks/clones)?
> I did run across this:
>
> http://stackoverflow.com/questions/17470780/is-it-possible-to-slim-a-git-repository-without-rewriting-history
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to