0. Introduction =========== The fist incarnation of onegit.sh, despite all the tuning effort, was still taking 36 hours to run. That was within the 'we should be able to do the conversion over a week-end' criteria, but it was still painfully long.
So, plan B, I tried to use git fast-export/import instead of git filter-branch. That plan proved successful and now the conversion itself takes about 30 _minutes_ (add another 15 for a final git gc of the resulting core.git and a couple of hours to upload it all) The core of plan B is lo_git_rewrite, a small C program that massage the data stream between git fast-export and git-fast-import. It is available in the dev-tools git repo. 1. Usage Notes ============ If you want to try it for yourself, here are few things to know: 1.1 Pre-requisites 1.1.1 Platform This has only been tested on Linux. Other platform may work, but use at your own risk. 1.1.2 Git This has been tested with git 1.7.3.4. Any recent git should work, but lo_git_rewrite makes a lot of implicit assumptions about the data stream provided by git fast-export, so any version of git that alter that flow, even in a way compatible with git fast-import specifications may cause trouble. 1.1.3 source git repos You need to have a 'source' bootstrap tree, including clone/translation. make sure that master is checked-out and that you are up-to-date and clean. 1.1.4 dev-tools You need to clone the contrib/dev-tools repo and run make in dev-tools/lo_git_rewrite 1.1.5 temp space Most of the work is done in a temporary directory. you need 5+GB of space there (I don't know for sure the exact amount but 5GB should be enough) 1.1.6 target repos The ongit.sh script will create a target repository, with clone/* populated with the remaining separate git repos (help, translation,s, dictionaries and binfilter). The core repo is initially not properly compacted, and since you porbably want to build it, you need enough space... as a rule of thumb, count the same amount of space you would normally reserve for a regular bootstrap buid. 1.2 Running Assuming that your source bootstrap repo is at /lo/libo, that the target will be /lo/core, that the dev-tools repo is at /lo/dev-tools and that your temporary workspace is /fast, then run: cd /lo time ./dev-tools/onegit/onegit.sh -f -g /lo/libo -n core -t /fast/gittemp 2>&1 | tee onegit.log while it is running you can look at /lo/onegit.msgs. it contains a high-level log of what is going on. Note that in onegit.msgs lines should start with "===" any line that start with "***" indicate that something went wrong. Note: the onegit.sh has been tuned to work optimally on a Intel Xeon X3360 @ 2.83GHz (quad-core), with 8GB of memory and pretty good disks. For optimal result on a different machine you may need to tweak the number of batch that ran in // and their composition. (see section 2.2.2 for gotcha). 1.3 Known issue The onegit.sh script, as a final step try to apply a set of patches, to fix issue related to the migration. unfortunately, since master is a moving target these patches may fail to apply. At this stage the conversion is done and core is usable. you can try to fix the patches that failed to apply (and apply the rest of the patches) the patches are in dev-tools/onegit/patches/* 1.4 Testing once all the patches are applied. you can start using the 'core' repo as if it was bootstrap. 2. Reviewer Notes ============== Reviews are of course welcomed In order to help with the review, here are a few pointers. 2.1 Review of lo_git_rewrite lo_git_rewrite is a fairly small C program that sit between git fast-export and git fast-import. Its goal to to fix the trailing spaces, tab issues and to optionally exclude or filter-out a specific module and/or filter-out files with a specific extension. 2.1.1 arguments lo_git_rewrite understand to following command line argument. (note the syntax is --foo bar and NOT --foo=bar) all these arguments are optional --prefix "string" This is used to prefix output message to stderr with the specified string. this is used in onegit.sh because more than one instance of lo_git_rewrite is running in parallel, and this allow to link a message to a specific lo_git_rewrite instance the default is an empty string. --exclude-module "module_name" This tell lo_git_rewrite to filter-out any files whose name start with module_name/. This is used in onegit.sh to filter-out a module from a given repo, like binfilter or dictionaries. --fitter-module "module_name" This tell lo_git_rewrite to filter-out any files whose name does _not_ start with module_name/. This is used in onegit.sh to extract a given module from an existing repo, like binfilter or dictionaries. --exclude-suffix "string" This tell lo_git_rewrite to exclude any files whose name end with "string". This is used in onegit.sh to eliminated obsolete .tar.gz file out of libs-extern-sys and libs-extern history. --buffer-size nnn This tell lo_git_rewrite to allocate a working buffer of nnn MB. nnn must be a number between 10 and 1024. by default lo_git_rewrite allocate a 30MB buffer and an additional 45MB (nnn * 1.5) temp buffer to do file content conversion. The reason to use this is that the buffer need to be big enough to contain the biggest blob that can be encountered in the stream. This is used by onegit.sh for 2 of the repo that have particularly large blob ( libs-extern-sys and extension) 2.1.2 Operation git fast-export create a stream of 'objects'. Object are identified and referenced using an id in the form of :<number> For our purpose there are 3 types of objects: blob, commit and tag blob come in the stream before they are used, and at that point there is no indication what filename(s) will be associated with it. so we need to 'clean' every blob and re-inject two copies of the blob in the stream so that later (when we have a filename) we can decide which copy we need to use. The Problem of course is that we need to assign a unique id to the new blobs we create 'on-the-fly'. The technique lo_git_rewrite use is to intercept all id in the form :<number> and transform them into :<number>0 except for the extra blob we create of the fly which get assigned :<number>1 where :<number> is the id of the original version of the blob. Note: we could have use :2*<number> and :2*<number>+1, but that would have required to convert text to integer and vice-verso for each occurrence of such an id, and for libs-core, for instance, fast git-import report more than 1 Billion of such id in the stream (yes Billion as in 10^9, 1,073,741,824 to be precise :-) ) commit are where all the meat is. lo_git_rewrite analyze <filemodify> and <filedelete> entries. depending on the filename of these entry and the fitlering rules each entry is ether modifyed it use :<number>0 or :<number>1 depending if we want the 'sanitized' version of the blob of not, or simply removed if that filename need to be filtered-out. There is no attempt to eliminated 'empty' commit as a result of every filemodify/filedelete entry being removed tag are only modified to substitute id with :<numbrr>0. The code that 'sanitize' blog is essentially the same than the one that existed in clean_spaces. the main difference is that the 'copy-on-write' optimisation that exist in clean_space has been removed since we always want an actual copy. lo_git_write trust git filter-export to produce a sane, predictable stream. there is very little code/cpu expended to check that the input stream is sane. The goal was speed and simplicity, not mis-use robustness. 2.2 Review of onegit.sh 2.2.1 Arguments onegit.sh --help display a short summary of the argument supported and default value. 2.2.2 Operation The script is organized in 3 sections. first we check the argument and verify that the environment is sane and that we have all we need. then we run 4 parallel batch section that are balanced to that they should finish at about the same time. One implicit requirement is that the 'processing' of bootstrap need to finish before any other repo, that is why the first task of each other batch is a 'big' repo that is guaranteed to take significantly longer than bootstrap to finish. finally a tag is applied on the target repos and patches are applied to make the resulting repos 'buildable' Norbert _______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice