>If there were permanently bad memory locations, surely the diagnostics >would reveal them. Here's an interesting paper on memory errors: >http://www.ece.rochester.edu/~mihuang/PAPERS/hotdep07.pdf >Given the inevitability of relatively frequent transient memory >errors, I would think it behooves the file system to minimize the >effects of such errors. But I won't belabor the point except to >suggest that the cost of adding the suggested step would not be >very expensive (either to implement or run).
I'm still not clear what you win. You copy the data (which isn't actually that cheap, especially when running a load which uses a lot of memory bandwidth). And no what? You can't write two different checksums; I mean, we're mirroring the data so it MUST BE THE SAME. (A different checksum would be wrong: I don't think ZFS will allow different checksums for different sides of a mirror) You are assuming that the error is the memory being modified after computing the checksums; I would say that that is unlikely; I think it's a bit more likely that the data gets corrupted when it's handled by the disk controller or the disk itself. (The data is continuously re-written by the DRAM controller) >Memory diagnostics ran for a full 12 hours with no errors. Same goes >for both disks, using Solaris format/ana/verify. So far, after >creating 400,000 files, two files had permanent, apparently truly >unrecoverable errors and could not be read by anything. It would have been nice if we were able to recover the contents of the file; if you also know what was supposed to be there, you can diff and then we can find out what was wrong. >Now it gets really funky. I detached one of the disks, and then found >it couldn't be reattached. Turns out there is a rounding problem with >Solaris fdisk (run from format) that can cause identical partitions on >identical disks to have different sizes. I used the Linux sfdisk >utility to repair the MBR and fix the Solaris2 partition sizes. Then >it was possible to reattach the disk. Unfortunately it wasn't possible >to boot from the result, but a reinstall went perfectly with no ZFS >errors being reported at all. So it appears that the problem may be >with the OpenSolaris fdisk. Is this worth reporting as a bug? It is >likely to be quite hard to reproduce... There might be some skeletons buried in the ide device drivers; I once had a disk which broke (well, one sector was broken or more); so I added "bad sectors" in format. But the disk seemed to be bad, even after I "check disk" tool from Western Digital. The disk would hang when I read certain bits. Then I copied the disk to an identical disk; it hang in the same way. Then I "zapped" the copy, relabeled it, copied the data per slice (not the while disk, but a per slice) and then the new disk worked. So while the first disk was broken (it Western Digital tool moved some sectors somewhere else), adding "bad sectors" in Solaris broke "something else". Casper _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss