>>>>> "BE" == Ben Escoto <[EMAIL PROTECTED]> >>>>> wrote the following on Tue, 04 Jun 2002 10:02:58 -0700
KE> When I finally took the time to properly read Rusty's KE> "gzip-rsyncable" patch[1] while writing this mail, I discovered KE> that it appears to use this same general technique, although the KE> heuristic he has chosen is "the sum of the previous 4096 bytes KE> of source text is divisible by 4096". I think my heuristic KE> could allow the gzips to sync up faster (after 12 identical KE> bytes of source text following a change, compared to 4096), KE> whilst still having the same compression ratio hit (resets every KE> 4096 bytes, on average), but I've only come up with this today KE> so I haven't done as much thinking on the subject as Rusty - KE> likely there is some "gotcha" that I haven't seen. BE> Anyway, that is just an example, the mod 4096 scheme seems BE> to be "more random" and less data dependent than yours, but I BE> don't have a good way of saying this clearly, or proving that BE> his scheme maximizes this property. Sorry, how about this: Let's suppose the data is a generated by a Bernoulli process, so each byte is independent of every other and determined by the same probability distribution. Then it appears the probability of his scheme generating a reset at any given byte is still 1/4096. However, yours may have between a 0 chance and a (8/12)^8 * (4/12)^4 =approx 1/2075 chance (because you included more 1's than 0's - different people may have more equally distributed initials...). So at least in this sense, his is constant of different probability distributions while yours can vary more dramatically. Of course, real data isn't generated by a Bernoulli process generally, but I think this is a pretty good model and the results would hold for similar types (i.e. low order Markov process). -- Ben Escoto -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html