On 07/22/2010 10:11 AM, Stefan Sperling wrote: > So we should not have added the additional checksums in the first place? > You might as well point fingers at the old backends for not storing checksums, > rather than at Hyrum trying to do his best with what he had.
You have to define "his best" here. Hyrum could have done his best to avoid adding to the cost of dumpfile creation by passing FALSE for the 'force' parameter to svn_fs_file_checksum(). Hyrum could have done his best to carry self-validating information for all file contents by passing TRUE, instead, and then changing the surrounding the code to expect a valid checksum from those function calls. Given the state of the code today, Hyrum didn't do his best at either of those approaches. :-) >> (Some of those may reduce away if the >> backend is able to store the SHA1's forcibly generated from previous >> requests.) > > Is this already implemented? So if people upgrade the repository format, > over time the dump performance will get better again? It probably is for BDB repositories. It probably is not for FSFS repositories. I probably won't use this opportunity to gripe about FSFS. Again. Or maybe I will. > Let me explain why I did this: > > The idea for this change came to me in a conversation with Michael Diers. > He said he saw no way of verifying that svnsync created proper duplicates > of a repository. He had seen svnsync breaking repositories at a customer > site like this: > > Transmitting file data .svnsync: Base checksum mismatch on 'foobar': > expected: a9c9eb4994b50f9667e0c1509782df26 > actual: ae3957e324c0ee2228d76732c9a89747 [...] > Obviously svnsync did verification *after* something got broken. > But why didn't svnsync detect the checksum mismatch before committing > a broken transaction? I don't know. They could not reproduce the problem > either. But they've seen it and they are afraid of it happening again. svnsync doesn't do any checksum verification. That work is done by deeper levels. But you provide no support for the implication that Subversion didn't attempt to verify checksums at all the reasonable times. The error you show above happens deep inside the FS implementations' DAG layer, as a sanity check on a text-base before apply a delta to it. It's probably comparing the checksum stored in the FS for a given text-base against the checksum coming across the wire as the delta-base checksum. The FS layer doesn't accept untrusted checksums. That is, you can't tell the FS what the checksum for a file's contents are -- you can only tell it the file's contents, and you can tell it what you expect the checksum to be after you store new file contents. I suspect the customer had actual repository corruption that was only detectable by svnsync after the fact because svnsync had nothing to do (directly, at least) with that corruption in the first place. > They trust Subversion with their data, and such problems don't exactly > encourage them to keep trusting it. So I think we shouldn't dismiss the > value of internal integrity checks entirely just because most parts of > Subversion don't care [yet]. Of course such checks need to be placed > in the right spots. That's fair. I certainly don't want to play down the importance of Subversion's ability to reliably version folks' data! I see the error above as proof that Subversion is doing exactly that, not that it has failed to do so. > Though I agree that relying on content checksums to make sure data copied > from one computer to another has been copied correctly is not that > important anymore these days. We're not copying data across noisy serial > links anymore, and have checksumming on many different levels (e.g. > ethernet CRC and disk controller CRC). > But what additional verification shields us from is subtle bugs in our > own code that cause data corruption. These cannot be caught by lower > layers and evidently seem to exist (or maybe the customer had a hardware > problem, but I doubt that). > > So what's nice is that the content checksums allow verification of data > read back from a freshly loaded repository, or even from a transaction > before commit. I.e. the loader can make sure that the data that it wrote > into the repository matches what the dump file expects. This can catch > bugs in the loader code. My plan was to make svnsync do similar checks later. > > As for integrity of dump files themselves, we should tell paranoid > people to sign dump files after they've been created, and verify the > signatures independently of Subversion. Okay. It seems (if I'm reading you correctly) that we agree we shouldn't be using checksums to verify the integrity of the dump stream. That the value to be had here is in verifying that target repository, once loaded, has the right data. So may I suggest that, when dealing with property lists, it's just a lot simpler (especially with our APIs) to compare the before-and-after property lists than to trade in the currency of checksums? I mean, for file contents it's arguably cheaper to do a checksum compare than a byte-for-byte comparison (which isn't even possible in --deltas mode), and that mostly because the checksum is pre-computed. But for properties? We don't have the checksums handy (so there's extra cost to compute them), and we hold those things en totale in memory all the time, and have handy routines for finding diffs. I dunno. The property checksums just seem unnecessary in the whole scheme of things to me. (And I will further suggest that we fix the dump.c code to pass FALSE for the 'force' parameter of svn_fs_file_checksum() calls, thus restoring the pre-Hyrums-change behavior.) -- C. Michael Pilato <cmpil...@collab.net> CollabNet <> www.collab.net <> Distributed Development On Demand
signature.asc
Description: OpenPGP digital signature