On Sun, Dec 12, 2010 at 9:23 AM, <stef...@apache.org> wrote: > Author: stefan2 > Date: Sun Dec 12 15:23:18 2010 > New Revision: 1044833 > > URL: http://svn.apache.org/viewvc?rev=1044833&view=rev > Log: > Get some ideas for FSFS format improvements written down. > > * notes/fsfs-improvements.txt: new file > > Added: > subversion/trunk/notes/fsfs-improvements.txt (with props) > > Added: subversion/trunk/notes/fsfs-improvements.txt > URL: > http://svn.apache.org/viewvc/subversion/trunk/notes/fsfs-improvements.txt?rev=1044833&view=auto > ============================================================================== > --- subversion/trunk/notes/fsfs-improvements.txt (added) > +++ subversion/trunk/notes/fsfs-improvements.txt Sun Dec 12 15:23:18 2010 > @@ -0,0 +1,224 @@ > +Introduction > +------------ > + > +The FSFS external data format can be changed to allow for > +significantly reduced I/O overhead without changing the > +fundamental ideas behind FSFS. > + > +The whole idea is to rearrange packed revision info in a > +new FSFS format "6". A two-way conversion between format > +"5" and "6" should be possible as well as supporting both > +formats for different repositories on the same server.
Generally, I like these ideas. As I understand it, they would be implemented as part of the packing process, or the pack files would be rewritten as part of the repos format upgrade. This could get a bit tricky, since FSFS depends on static file offsets for referencing data within revision files. Not impossible, just tricky. > + > + > +Revision Order > +-------------- > + > +FSFS format "5" packs revisions by putting revision 0 at > +the beginning of the file and concatenating the others in > +ascending order. This intuitive design is counter-productive > +because we always trace data HEAD -> r0 and scanning a file > +backwards is expensive. > + > ++-------+ > +| rev 0 | > ++-------+ > +| rev 1 | > ++-------+ > + > +: ... : > + > ++-------+ > +| rev N | > ++-------+ > +<EOF> > + > +A counter-intuitive but more efficient reversed order (de- > +scending revision order in a packed file) results in always > +reading forward through a file. > + > ++-------+ > +| rev N | > ++-------+ > + > +: ... : > + > ++-------+ > +| rev 1 | > ++-------+ > +| rev 0 | > ++-------+ > +<EOF> > + > + > +Don't Interleave Revision and Content Data > +------------------------------------------ > + > +Format "5" keeps the each revision as a single block. When > +re-constructing a node content from the diffs, we jump > +through a number of distant revision blocks. For each node. > + > ++--------------+ > +| rev N Reps | > ++--------------+ > +| rev N Header | > ++--------------+ > + > +: ... : > + > ++--------------+ > +| rev 1 Reps | > ++--------------+ > +| rev 1 Header | > ++--------------+ > +| rev 0 Reps | > ++--------------+ > +| rev 0 Header | > ++--------------+ > +<EOF> > + > +Instead, delta-info should be removed from the revision > +blocks and put at the end of the file. This speeds up > +header-only operations like "log" without impairing node > +content lookup. And it lays the foundation for further > +improvements. > + > ++--------------+ > +| rev N Header | > ++--------------+ > + > +: ... : > + > ++--------------+ > +| rev 1 Header | > ++--------------+ > +| rev 0 Header | > ++--------------+ > +| rev N Reps | > ++--------------+ > + > +: ... : > + > ++--------------+ > +| rev 1 Reps | > ++--------------+ > +| rev 0 Reps | > ++--------------+ > +<EOF> > + > +The only downside is that putting headers first is somewhat > +more computationally expensive since offsets needs to be > +calculated in advance. This is made even more difficult by > +using a variable length encoding for those offsets. Does the statement about computation cost apply to the initial placement, or subsequent accesses? (I would guess the former.) > + > + > +Group the Representations by Delta Tree > +--------------------------------------- > + > +All diffs that are based on the same node within the packed > +revision file are put in one place. Re-constructing a node > +would then involve reading the revision headers (relatively > +close together) and the content deltas after that (again > +with high locality). > + > ++------------------+ > +| rev N Header | > ++------------------+ > + > +: ... : > + > ++------------------+ > +| rev 1 Header | > ++------------------+ > +| rev 0 Header | > ++------------------+ > +| node tree A Reps | > ++------------------+ > +| node tree B Reps | > ++------------------+ > + > +: ... : > + > ++------------------+ > +| node tree Z Reps | > ++------------------+ > +<EOF> > + > +We can optimize that even further by exploiting the skip- > +delta information: the "walks" through the delta tree for a > +given node will merge bit by bit into a main "trunk". The > +nodes on these paths can be re-ordered such that very few > +seek() operations are required on average to reconstruct > +the node content in this file. > + > +Example of 8 changes in revs r0 .. r7 (for simplicity), > +looking at the delta-info only ("->" means "stored as diff > +against"): > + > + r0 > + r1->r0 > + r2->r0 > + r3->r2 > + r4->r0 > + r5->r4 > + r6->r4 > + r7->r6 > + > +Default ordering r7,r6,r5,r4,r3,r2,r1,r0 requires > +3/3/2/2/2/2/1/1 seeks. For 2^N changes (N integer > 0), > +we need (N+1)/2 seeks on average. > + > +A path-optimized ordering r1,r3,r2,r5,r7,r6,r4,r0 > +requires 2/2/2/2/1/1/1/1 seeks. It's (N+3)/4 on average. > +That optimized ordering can easily be constructed by > + > + * select the highest rev not covered, yet > + * prepend its diff path to the existing result > + (until we meet a revision at already is in > + in the result) > + * repeat until all revs are covered > + > + > +Move the Manifest Info into the Packed File > +------------------------------------------- > + > +Once we mastered the offset pre-calculation issue (see > +above), we may as well put the manifest info at the be- > +ginning of the file. This will benefit "log" in particular > +as only one consecutive chunk of data needs to be read > +from only one file. I don't completely understand the benefits we get from this, since the manifest is read one and then cached. I would think that the accesses into the pack file would dominate I/O, and the placement of the manifest wouldn't matter much (though I could certainly be wrong). > + > ++------------------+ > +| rev 0 Offset | > ++------------------+ > +| rev 1 Offset | > ++------------------+ > + > +: ... : > + > ++------------------+ > +| rev N Offset | > ++------------------+ > +| rev N Header | > ++------------------+ > + > +: ... : > + > ++------------------+ > +| rev 1 Header | > ++------------------+ > +| rev 0 Header | > ++------------------+ > +| node tree A Reps | > ++------------------+ > +| node tree B Reps | > ++------------------+ > + > +: ... : > + > ++------------------+ > +| node tree Z Reps | > ++------------------+ > +<EOF> > + > + > > Propchange: subversion/trunk/notes/fsfs-improvements.txt > ------------------------------------------------------------------------------ > svn:eol-style = native > > > Overall, it looks like a good plan. Thanks! -Hyrum