> The Python job finished successfully here after 10 hours. 6h40 mins here as I ported your improved logic to the python2 version :).
# git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD Rewrite 1169d256b27eb7244273671582cc08ba88002819 (68356/68357) (24226 seconds passed, remaining 0 predicted) Ref 'refs/heads/master' was rewritten The tree-filter blows up the .git/objects store to 13G though. But nothing a git gc can't fix. > > I did some tests on the new git repository. Cloning the repository from > scratch takes around 2 minutes (the original repo: 21 minutes). Confirmed. > So that's about it. I have not done a thorough job at checking the > actual *integrity* of the results. It's difficult, considering CVE > identifiers are not sequential in the data/CVE/list file, so a naive > diff like this will fail: > > $ diff -u <(cat > ../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999} > ) data/CVE/list | diffstat > list |106562 > +++++++++++++++++++++++++++++++++---------------------------------- > 1 file changed, 53281 insertions(+), 53281 deletions(-) > > But at least the numbers add up: it looks like no line is lost. And > indeed, it looks like all CVEs add up: > > $ diff -u <(cat > ../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999} > | grep ^CVE | sort -n ) <( grep ^CVE data/CVE/list | sort -n ) | diffstat > 0 files changed > > A cursory look at the diff seems to indicate it is clean, however. I uploaded "my" version to https://people.debian.org/~dlange/ so people can poke the log and diffs and see whether there are any issues left. > I looked at splitting that file per CVE. That did not scale and just > created new problems. But splitting by *year* seems like a very > efficient switch, and I think it would be worth pursuing that idea > forward. The tools in bin/ would need a brush through. I.e. throw away the unused ones and amend the ones that are used on data/CVE/* to learn about the split files.