Elazar Leibovich <elaz...@gmail.com> writes: > On Wed, May 8, 2013 at 11:11 PM, Tzafrir Cohen <tzaf...@cohens.org.il> > wrote: > > > > Git stores files. It should do handle such deduping by design. But > this > is in Git's storage, and not in the actual filesystem: > > > git packs them in a pack file. > > Use git gc to make it aware of changes, or just look at my reply to > Oleg.
Well, I have a horrible suspicion that I did not make myself quite clear. At least that is my impression from your and Tzafrir's replies. Let me try and rephrase. I am going to make some assumptions about what is going on as you build. I obvously think these assumptions are reasonably close to realiy, you'll tell us which of them break down. Then I will review the procedure I suggested yesterday, and discuss when it (something like it) may be needed. I would suggest - assuming my explanation below is clear enough - that you take a few of the actual builds of yours that differ a bit and commit them as sequential revisions of te same file (as I describe below) into svn/git/hg/whatever and see how much space the repository takes in each case. It is not clear to me how what you did with git is related to my suggestion, and it does look to me that Tzafrir misread my itent. Of course, it may be me who misses something. To assumptions: A1. You have a build procedure that updates from a source version control repository, does a build - full or incremental, I'll touch upon it a bit later - and creates a tar.gz file with jars inside. The build itself is unattended, even if it is triggered manually (or on schedule, or on event - whatever). At present, all the builds are kept as independent files, with names like build-<NNN>.tar.gz (where <NNN> is build number that is incremented) or build-<yyyymmdd-HHMM>.tar.gz (where <yyyymmdd-HHMM> is the timestamp) or according to some other naming scheme the particulars of which are not very important. A2. You have an install procedure that takes one of the build-*.tar.gz files as input. Whether there is a command, a script, or whatever is not important. The actual argument - functionally, not by implementation - is the build number or timestamp or some other identiier tha allows the install procedure (maybe the user, manually) to find the right archive. A3. Your problem or concern is that all the numerous builds together take too much space even though they don't differ that much. This is what you want to alleviate. A4. Your development team is used to the curent procedure and you want to keep things as transparent as possible. You are willing to change some things (e.g., keep the built archives in a different place, on a different partition with a different file system, and make the install tool - or users - aware of the change in pathnames, etc.) as long as it doesn't change or complicate the procedures too much from the users' point of view. A5. Of all the builds only a few are useful often. Possibly a few recent ones to facilitate rollbacks when a problem occurs, plus some known good old ones correponding to production releases, versions for specific customers, etc. I expect them to be identifiable, even if only by some excel spreadsheet that says that build 437 is production version 3.1 in the field (hopefuly something more automatic). I do not expect the whole team to know what a random build-328.tar.gz from 4 months ago corresponds to, or use it regularly. So my suggestion yesterday was as follows: 1. Create a new repository (or module, if your version control supports it) that will only hold a single file - build.tar.gz. The version control should be chosen to handle binary diffs efficiently (in space, and in time as a secondary consideration). Assuming svn/git/hg/whatever are all goot at it choose the tool your dev team is most comfortable with. 2. Add an extra step to the build procedure - either modify the build process itself or add an external step after the build. The step will be moving or copying build-<NNN>.tar.gz to results/build.tar.gz - always the same filename for all the different builds - that is under version control. NB: *clobber* the file, do not *add* another one as Tzafrir seems to have understood me. The version control will recognize the file as locally modified. Commit the change. The repository will hold binary deltas, i.e., if you do this 10 times your repository will not hold 10*X bytes (where X is the typical build size), but X+9*dX bytes, where the binary delta dX is much smaller than X. The file version may be made to correspond to build number (e.g., SVN increments version numbers on commit, though git/hg don't), or may be tagged symbolically as a part of the process. By assumption A1 above the build is unattended, so you should be able to do that without anyone noticing anything. 3. Once step 2 above is done you can decide which of the original build-<NNN>.tar.gz's you can remove. By assumption A5 you can remove many/most of them, e.g., all older than a week/month/whatever except a few marked as useful in the long run. Assuming that "useful" marks are detectable this step can be automated and incorporated into the build. 4. If you can modify the install procedure to get the right verion from the version control system and then use the checked out file (see assumption A2) then you can remove *all* the build results after they have been committed (and tagged) as revisions. If your current system is sane then such interference (steps 2-4 here) will not lead to a horrible disruption of your team's workflow. 5. If the installation procedure cannot be tampered with at all you still end up with a lot of space savings. According to step 3 you keep the likely useful builds exactly as they were before. In those rare cases when someone needs a random build 328 from 4 months ago that no one has touched since it was created it can be checked out manually, I am pretty sure your team can handle this task once in a blue moon. 6. As a bonus, your version control will allow you to get not only build 576 but also the last build before May 9, etc. Now, when is this exercise worthwhile? Only when the build procedure itself is prohibitively expensive/lengthy. If it is not then I'd say don't store your built archives forever and just rebuild from version control when needed (you do tag your source snapshots as build numbers, etc., right?). One should not store the build results for long unless it is necessary. So, what are the use cases that may justify storing build results? C1. The full build is waaaay too long, say, many hours. Your "continuous integration" process would not allow you to build many times a day if you did full builds, but you track dependencies intelligently and build incrementally (cf. assumption A1 above). C2. You find out that checking an archive out of version control is much more efficient than checking out the source and building, e.g., it is 10 seconds vs. 10 minutes, and it is needed often enough. C3. Your SOP is to link/test your changes against a large number of builds, corresponding to supported production releases, custom versions for specific clients, etc. And you find that checking out that many revisions from the repository and building them takes too long. Maybe even checking out the numerous archives is too sluggish. So keep those target builds as described in step 3 of the suggested procedure but remove everything else. There can't be too many of those - if there are then your support matrix is so huge that you have bigger problems than buying a terabyte disk. There may be other use cases that I am missing now, but the point is that you need to really think about your procedures and needs and understand whether or not a neat deduping trick (or a functional equivalent) is really needed. My guess is that in most cases it is not. -- Oleg Goldshmidt | p...@goldshmidt.org _______________________________________________ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il