There are also some rather large '.dat' files in the history as well, I
found this by running on a job to delete all blobs > 5MB from the history
via:

$ java -jar ~/Downloads/bfg-1.12.3.jar --strip-blobs-bigger-than 5M
--protect-blobs-from trunk,branch_5x,branch_4x lucene-solr-mirror

> Deleted files
> -------------
> Filename                         Git id
>
>
> -------------------------------------------------------------------------------------------
> DoubleArrayTrie.dat            | 8babf9fa (16.8 MB), f3bfe15b (16.8 MB),
> ...
> TokenInfoDictionary$buffer.dat | 25938b37 (7.0 MB), 7f02420f (7.1 MB), ...
>
> TokenInfoDictionary$trie.dat   | 69e76d64 (16.8 MB)
>
> dat.dat                        | 7445d1c8 (16.0 MB), 79bd7c8b (16.8 MB),
> 37a215e5 (16.8 MB)
> europarl.lines.txt.gz          | e0366f10 (5.5 MB)
>
> tid.dat                        | 5a1e6199 (24.9 MB), 996d3fc5 (28.1 MB),
> ...
> tid_map.dat                    | 690fbea5 (6.3 MB), c1c01405 (6.3 MB),
> 7a8c1420 (6.4 MB)
> wiki_results.txt               | db9e9294 (19.8 MB), 52ff9357 (19.8 MB),
> ...
> wiki_sentence.txt              | 3a38f62e (19.0 MB)

Dropping just those files reduced the repo by 50M, overall size is 131MB.

Note: there is one large file still in the trunk >5MB:

> * commit df1e3b32 (protected by 'trunk') - contains 1 dirty file :
> -
> lucene/test-framework/src/resources/org/apache/lucene/util/europarl.lines.txt.gz
> (5.5 MB)


Also, I failed to provide the numbers on what `git reflog expire
--expire=now --all && git gc --prune=now --aggressive` on a fresh mirror
checkout, it results in a repo size of 320M. So, dropping the old jars
saves 120MB.

-Steve

On Sun, May 31, 2015 at 4:39 PM, [email protected] <
[email protected]> wrote:

> I like where this is going!
>
> I also think history of source code is very important, but not history of
> ‘.jar’ files that shouldn’t have been in source control in the first
> place.  I’m fiercely negative about large binaries or ‘jar’ files that can
> be downloaded by the build system (e.g. ivy) in source control.  And it was
> already mentioned a full history (.jar’s & all) could be kept somewhere
> more for archival purposes — which is a good compromise, I think, since
> “build-ability” of history should be retained (assuming it’s even still
> possible, given Rob’s comments) but doesn’t have to be convenient (e.g. by
> it being in a separate repo).   +1 to that!
>
> If we were to come up with a new git repo that doesn’t have the ‘.jar’s,
> it’d be good to also streamline the history prior to the big Lucene + Solr
> merge due to the paths in source control as to where the trunk, branches,
> and tags lived.  It appears the current repo may have been a blind git
> import from subversion.  And hand-done process that is mindful of these
> things would result in a nice history.  I’ve done this sorta thing once (a
> project at my last job) and volunteer to do it here if we can get consensus
> on a move to git.
>
> ~ David
>
> On Sun, May 31, 2015 at 4:21 PM Dawid Weiss <[email protected]>
> wrote:
>
>> > I'd like to have full consolidated history, as much as possible,
>> > connect-the-dots across whatever CVS/SVN/etc repos to the extent
>> > maximally permitted by law, as Doug hints at. Just nuke the jars.
>>
>> I've done this (CVS->SVN->GIT) before. It wasn't that difficult.
>> Eventually (for git) you script it and it gets version after version
>> from CVS or SVN and appends it to git. I admit I didn't care much
>> about svn merging infos though. Any files can be removed/ pruned by
>> rewriting git trees before they're published.
>>
>> Dawid
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Reply via email to