Daniel Shahaf wrote on Thu, May 26, 2011 at 00:36:48 +0300: > I've not read the code, but would an array of 'svn_revnum_t[2]' be > a better representation? > > Specifically: an append-only array of [from_rev, to_rev] pairs, sorted > by from_rev. That's less overhead (and we could take advantage of the > sorting to store less data), at the cost of O(log(n)) lookup. >
And the numbers... well, for a plain C array it would be 70% less than in Greg's analysis of the hash case, since the additional 20 bytes per mapping entry are gone. (and that's before I suggest to not store [N+1, M+1] if [N,M] is already stored) > Greg Stein wrote on Wed, May 25, 2011 at 17:21:44 -0400: > > On Wed, May 25, 2011 at 16:08, C. Michael Pilato <cmpil...@collab.net> > > wrote: > > > On 05/25/2011 04:05 PM, C. Michael Pilato wrote: > > >> On 05/25/2011 03:49 PM, Greg Stein wrote: > > >>> On Wed, May 25, 2011 at 15:33, <cmpil...@apache.org> wrote: > > >>>> ... > > >>>> + /* A mapping of svn_revnum_t * dump stream revisions to their > > >>>> + corresponding svn_revnum_t * target repository revisions. */ > > >>>> + apr_hash_t *rev_map; > > >>> > > >>> How big can this grow? ie. what happens when there are several million > > >>> revisions. > > >> > > >> It gets big. (This logic and approach are copied from 'svnadmin load', > > >> which doesn't excuse it, but might explain it.) > > > > > > Actually, I don't really know for sure how big it gets. It's a mapping of > > > of sizeof(svn_revnum_t) to sizeof(svn_revnum_t), plus all the hash > > > internals. Anybody have any guesses? > > > > struct apr_hash_entry_t is generally 20 bytes. Add in the two revnums > > (4 bytes each), and you get 28 bytes for each *used* entry. > > > > Now we also have to account for unused entries. APR has a pretty poor > > hash table implementation. It allocates *upwards* to the nearest power > > of two. So the internal size will grow like: > > > > 1048576 > > 2097152 > > 4194304 > > > > One saving grace is that APR only grows when the entry count matches > > the internal table size. It uses a "closed hash" algorithm with linked > > lists at each bucket, so the actual load on the buckets is not > > possible to compute. The hand-wave means that you can put in 4 million > > mappings before it grows it up to 8 million buckets. > > > > So... 4 million buckets (pointers) at 4 bytes each is 80 megabytes. > > Each mapping will add another 28 bytes. So: 4 million mappings is > > about 134 megabytes. But also recognize that *reaching* that point > > will use and toss approx the same amount of memory. So about 260 meg > > total. > > > > On a 64-bit architecture, all these values are likely to be doubled. > > > > Not a machine crusher, in retrospect. But not exactly a winner either. > > > > Cheers, > > -g