Another thought would be to pursue a FUSE-like approach similar to scord [1][2] which implements a lightweight file system adapter that knows just enough about the pristine store and the working copy files such that it could maintain a single copy of the pristine contents for the overwhelming majority of files: those whose working-copy contents are identical to the pristine base contents. And of course it would also need to perform a transparent copy-on-write if/when any modifications (i.e. write, truncate, etc) are done to a working copy file. Note: this would not a be a full file system, just an adaptation layer top of the underlying file system that would manage the wc/pristine pairings and trigger copy-on-write as necessary. Each file managed by this layer would be either a direct pass through to a file in the underlying file system, or a reference to a pristine file.
If the full goal is to reduce pressure on the underlying file system in the presence of many large working copies (e.g. one per branch) then duplicate pristine contents, even with super-awesome compression would not match the space savings of a de-duplicated, pristine-aware, copy-on-write file system. [1] http://svn.haxx.se/dev/archive-2007-05/0486.shtml [2] http://scord.sourceforge.net/ regards, markt From: Julian Foad <julianfoad_at_btopenworld.com<julianfoad_at_btopenworld.com?Subject=Re:%20Compressed%20Pristines%20(Summary)> > Date: Mon, 2 Apr 2012 11:16:07 +0100 (BST) Hi Ashnod. 1. Filesystem compression. Would you like to assess the feasibility of compressing the pristine store by re-mounting the "pristines" subdirectory as a compressed subtree in the operating system's file system? This can be done (I believe) under Windows with NTFS < http://support.microsoft.com/kb/307987> and under Linux with FUSE-compress < http://code.google.com/p/fusecompress/>. Certainly the trade-offs are different, compared with implementing compression inside Subversion, but delegating the task to a third-party subsytem could give us a huge advantage in terms of reducing the ongoing maintenance cost. 2. Uncompressed copies. There has been a lot of discussion about achieving maximal compression by exploiting properties of similarity, ordering, and so on. That is an interesting topic. However, compression is notthe only thing the pristine store needs to do. The pristine store implementation also needs to provide *uncompressed* copies of the files. Some of the API consumers can and should read the data through svn_stream_t; this is the easy part. Other API consumers -- primarily those that invoke an external 'diff' tool -- need to be given access to a complete uncompressed file on disk. At the moment, we just pass them the path to the file in the pristine store. When the pristine file is compressed, I imagine we will need to implement a cache of uncompressed copies of the pristine files. The lifetimes of those uncompressed copies will need to be managed, and this may require some changes to the interface that is used to access them. A typical problem is: user runs "svn diff", svn starts up a GUI diff tool and passes it two paths: the path to an uncompressed copy of a pristine file, and the path of a working-copy file. The GUI tool runs as a separate process and the "svn" process finishes. Now the GUI diff is still running, accessing a file in our uncompressed-pristines cache. How do we manage this so that we don't immediately delete the uncompressed file while the GUI diff is still displaying it, and yet also know when to clean up our cache later? We could of course declare that the "pristine store" software layer is only responsible for providing streamy read access, and the management of uncompressed copies is the responsibility of higher level code. But no matter where we draw the boundary, that functionality has to be designed and implemented before we can successfully use any kind of compression. - Julian > From: Branko ÄŒibej <brane_at_apache.org<brane_at_apache.org?Subject=Re:%20Compressed%20Pristines%20(Summary)> > Date: Sun, 01 Apr 2012 09:23:58 +0200 On 31.03.2012 23:30, Ashod Nakashian wrote: *>>> Git can keep deleted items until git-gc is invoked, should we support * *>> something similar, we need to be consistent and probably support arbitrary * *>> revision history, which is out of scope. * *>> * *>> I'm confused: how does revision history affect the pristine store? * *> If the pristine store also keeps multiple revisions, then it's a whole different set of features than what we are aiming for (at least for compressed pristines). * Certainly the pristine store keeps multiple revisions of files. After all, it's just a SHA-1 => contents dictionary, so every time you "svn update," you'll get new revisions of files in the pristine store. What the store doesn't do is /know/ about the revisions. Neither does the wc.db, which only tracks reference counts for the SHA-1 keys. Every time a file changes, its hash will change, too, a new key will be inserted in the pristine store, and the reference count for the old key will be decremented. I'm not sure what happens when the count reaches zero; used to be that only "svn cleanup" would delete unreferenced pristines, but ISTR this changed a while ago. In any case, the pristine store shouldn't worry about revisions, only about efficiently storing the contents. It doesn't even have to worry about reference counting, since wc.db already does that. -- Brane P.S.: If we ever implement local snapshots and/or local branches, it /still/ won't be the pristine store's problem to track whole-tree info. This is why I like the clear separation between pristine store, which is a simple dictionary, and wc.db, which is moderately complex. P.P.S.: When we transition from pristine store per working copy to pristine store per ~/.subversion directory, then the pristine store will have to track how many working copies are using it. But that's way in the future -- and another good reason to use a proper database for the indexing. Received on 2012-04-01 09:24:15 CEST