On Wed, Feb 17, 2010 at 18:32, Neels J Hofmeyr <ne...@elego.de> wrote: >... > +1 -- I was hoping it'd make sense to go that way. > - "always" internally checksum streams returned by pristine_read() > (until proven that performance is affected significantly) > - Allow a boolean flag to switch off this checksumming
yup... > - _read() knows the size of the pristine from the pristine file. It > will skip checksum comparison on premature close. Unless it does a stat(), it doesn't know. But that isn't need. The checksumming stream just reads to EOF. If it never hits EOF, then it never attempts to validate. Very simple. >... >>>> [ ] If the checksum is already known and the data exists in the local >>>> file system, there is no need to verify the checksum. >> >> I'm assuming you mean copying from a source in the local file system, >> and writing that into the pristine store. >> >> Yes, if we have a checksum for that file, then we can just use it >> since we (invariably) *just* computed it. > > Are you talking specifically about a tempfile? Because below, we all seem to > agree that it is dangerous to compute a checksum from an Actual working copy > file and to only copy it later. ? Yes. >>> That depends. If we are going to read it anyway... >>> In most cases I agree. The filesystem also has verifications on bit errors. >>> But you also have sed -i <.....> >> >> Not all filesystems do. That is why we record checksums for repository >> contents. That "one bit error every five years" needs to be >> detectable. > > ^ Further support for rather doing checksums with every read, where certain > situations in the code get "special permission" to switch off checksumming. I doubt working copies will suffer degradation like a server repository, but ... yeah. Sure. >... >> I think a flag to pristine_read() may be the way to go. > > And I'd prefer to term it from the other end, as I stated above, i.e. to say > "do checksums unless we know in advance that it makes no sense, and unless > we know that it hits performance really hard". Sure. I would also have no problem with a separate function like svn_wc__db_pristine_read_unsafely() :-) >... > To prevent two concurrently running clients from installing the same > pristine twice, a file lock seems more appropriate (with one wc.db per WC, > benefit of even that is debatable, but with one wc.db per user that might be > good). Eep. I hadn't considered this. Okay. Forgot all that I've written about allowing faulty files, and writing directly to the target location. Tempfile-then-move is best. Thanks. >... >>>> [ ] It is ok to open a write stream to the final location of the >>>> pristine file for a given checksum, no need to store in a >>>> pristine-tempfile first. >>> I think we should say no here, to guarantee a stable behavior. If the file >>> is in the pristine store it should be OK. We can't trust the initial write >>> to always provide the full file and a valid hash of this file. >>> We should calculate ourselves if it is possible that we receive an >>> incomplete file. >> >> The schema allows for this, so it can/should be just fine. > > This kinda contradicts the impression I get from you saying "_write() has to > go" and "we always copy to temp before/while calculating the checksum". > (s.b. ...) _write() just can't function properly because of the stream-close-means-WHAT? issue. That's independent of allowing faulty files into the store. >>> And calculating ourselves defines that we don't know the final location. >> >> Usually, we assemble a file in the temp area that is detranslated. At >> that time, we can compute the checksum. And then we install the file >> as a pristine. >> >> We might have a file incoming from the repository with a known >> checksum. We can spool that directly into the pristine store. > > (...) But when we receive a pristine over the network, we should surely > verify its checksum before writing it to the final pristine file location? > > I tend to favor doing all partial work in the pristine tempdir and OS-moving > into place once the file is complete. Agreed (now), for the multi-process issue. >... >>> ... >>>> SHA1/MD5 compat. >>>> [ ] When talking to an old server, we use only MD5 checksums to reference >>>> the pristine in the pristine store, nevermind hash collisions. >> >> If we don't have a SHA1 handy, then I don't mind this. But I think we >> should store stuff using *only* MD5 or *only* SHA1. Mixing them could >> end up with duplicates. Not sure if that is a problem, but it feels >> like it could be. > > I was hoping to get a -1 on this. To me it feels preferable to make the > pristine store *always* use SHA1 for the hash index. I'd be totally against > having a mixed hash index. Given that streaming to create a checksum when > receiving from the repos should be cheap, let's only use SHA1 indexes? That's fine. Just saying I don't mind either way, as long as we stick to *one* form of checksums. >>> ... >>>> [ ] When talking to an old server, we use SHA1 to index the pristine >>>> store and also store the MD5 checksum for server comm. We create the >>>> SHA1 checksum locally after having verified the MD5 checksum. >>> +1. This is the original WC-NG plan >> >> I wouldn't go *that* far. There was definitely some hand-waving going >> on. The MD5 in the editor API is a severely complicating item. I had >> hoped to fix that particular problem with Ev2. > > ...so you disagree. nono... was just disagreeing with Bert's assertion that it was "the original WC-NG plan". > What about md5 collisions? Are they unlikely enough? > Also, with only SHA1 in use, we don't need to maintain code that figures out > which kind of checksum is used in a pristine store. > > What say you? My assumption is that we'll figure out how to get a SHA-1 for any file we wish to install, despite the Editor API problems, and (thus) we'll use SHA-1 for the index. And that further means that we'll store MD5's into the metadata store. But if you *can't* always get a SHA1 easily, and determine it would be easier *today* to use MD5 for the indexes? Fine. If we can get Ev2 into 1.7 and solve some of these problems? Awesomeness. >>>> [ ] When talking to a 1.7 server, we will receive and use only SHA1 >>>> checksums and don't store any MD5 checksums. >>> There is currently no delta editor upgrade for 1.7, so we still have to send >>> MD5 to our 1.7 server as it is now. And to wrap old editors with new editors >>> (and reversed) we probably have to send both hashes whenever possible. >>> (The repository filesystem already stores SHA1 hashes since 1.6 or ?) >> >> I think we need to keep MD5 checksums around, in addition to the SHA1 >> for compat with the Editor v1 APIs. Sigh. > > Group sigh. Bah. You're just thinking too small. Work on Editor v2!! ... hehehe Actually, I suspect that we *may* want to use Ev2 in one or two places to make our lives manageable. Not a full conversion of all editors. Just a targeted, surgical replacement of (say) the update editor. Cheers, -g