There are also some rather large '.dat' files in the history as well, I found this by running on a job to delete all blobs > 5MB from the history via:
$ java -jar ~/Downloads/bfg-1.12.3.jar --strip-blobs-bigger-than 5M --protect-blobs-from trunk,branch_5x,branch_4x lucene-solr-mirror > Deleted files > ------------- > Filename Git id > > > ------------------------------------------------------------------------------------------- > DoubleArrayTrie.dat | 8babf9fa (16.8 MB), f3bfe15b (16.8 MB), > ... > TokenInfoDictionary$buffer.dat | 25938b37 (7.0 MB), 7f02420f (7.1 MB), ... > > TokenInfoDictionary$trie.dat | 69e76d64 (16.8 MB) > > dat.dat | 7445d1c8 (16.0 MB), 79bd7c8b (16.8 MB), > 37a215e5 (16.8 MB) > europarl.lines.txt.gz | e0366f10 (5.5 MB) > > tid.dat | 5a1e6199 (24.9 MB), 996d3fc5 (28.1 MB), > ... > tid_map.dat | 690fbea5 (6.3 MB), c1c01405 (6.3 MB), > 7a8c1420 (6.4 MB) > wiki_results.txt | db9e9294 (19.8 MB), 52ff9357 (19.8 MB), > ... > wiki_sentence.txt | 3a38f62e (19.0 MB) Dropping just those files reduced the repo by 50M, overall size is 131MB. Note: there is one large file still in the trunk >5MB: > * commit df1e3b32 (protected by 'trunk') - contains 1 dirty file : > - > lucene/test-framework/src/resources/org/apache/lucene/util/europarl.lines.txt.gz > (5.5 MB) Also, I failed to provide the numbers on what `git reflog expire --expire=now --all && git gc --prune=now --aggressive` on a fresh mirror checkout, it results in a repo size of 320M. So, dropping the old jars saves 120MB. -Steve On Sun, May 31, 2015 at 4:39 PM, [email protected] < [email protected]> wrote: > I like where this is going! > > I also think history of source code is very important, but not history of > ‘.jar’ files that shouldn’t have been in source control in the first > place. I’m fiercely negative about large binaries or ‘jar’ files that can > be downloaded by the build system (e.g. ivy) in source control. And it was > already mentioned a full history (.jar’s & all) could be kept somewhere > more for archival purposes — which is a good compromise, I think, since > “build-ability” of history should be retained (assuming it’s even still > possible, given Rob’s comments) but doesn’t have to be convenient (e.g. by > it being in a separate repo). +1 to that! > > If we were to come up with a new git repo that doesn’t have the ‘.jar’s, > it’d be good to also streamline the history prior to the big Lucene + Solr > merge due to the paths in source control as to where the trunk, branches, > and tags lived. It appears the current repo may have been a blind git > import from subversion. And hand-done process that is mindful of these > things would result in a nice history. I’ve done this sorta thing once (a > project at my last job) and volunteer to do it here if we can get consensus > on a move to git. > > ~ David > > On Sun, May 31, 2015 at 4:21 PM Dawid Weiss <[email protected]> > wrote: > >> > I'd like to have full consolidated history, as much as possible, >> > connect-the-dots across whatever CVS/SVN/etc repos to the extent >> > maximally permitted by law, as Doug hints at. Just nuke the jars. >> >> I've done this (CVS->SVN->GIT) before. It wasn't that difficult. >> Eventually (for git) you script it and it gets version after version >> from CVS or SVN and appends it to git. I admit I didn't care much >> about svn merging infos though. Any files can be removed/ pruned by >> rewriting git trees before they're published. >> >> Dawid >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
