Re: [zfs-discuss] Errors on mirrored drive

2009-05-29 Thread Frank Middleton
On 05/26/09 13:07, Kjetil Torgrim Homme wrote: also thank you, all ZFS developers, for your great job :-) I'll second that! A great achievement - puts Solaris in a league of its own, so much so, you'd want to run it on all your hardware, however crappy the hardware might be ;-) There are too m

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Kjetil Torgrim Homme
Frank Middleton writes: > Exactly. My whole point. And without ECC there's no way of knowing. > But if the data is damaged /after/ checksum but /before/ write, then > you have a real problem... we can't do much to protect ourselves from damage to the data itself (an extra copy in RAM will help l

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Richard Elling
Frank brings up some interesting ideas, some of which might need some additional thoughts... Frank Middleton wrote: On 05/23/09 10:21, Richard Elling wrote: This forum is littered with claims of "zfs checksums are broken" where the root cause turned out to be faulty hardware or firmware in the

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Bob Friesenhahn
On Tue, 26 May 2009, Frank Middleton wrote: Just asking if an option for machines with no ecc and their inevitable memory errors is a reasonable thing to suggest in an RFE. Machines lacking ECC do not suffer from "inevitable memory errors". Memory errors are not like death and taxes. Exactl

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Darren J Moffat
Bob Friesenhahn wrote: On Tue, 26 May 2009, Frank Middleton wrote: Just asking if an option for machines with no ecc and their inevitable memory errors is a reasonable thing to suggest in an RFE. Machines lacking ECC do not suffer from "inevitable memory errors". Memory errors are not like de

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Toby Thain
On 26-May-09, at 10:21 AM, Frank Middleton wrote: On 05/26/09 03:23, casper@sun.com wrote: And where exactly do you get the second good copy of the data? From the first. And if it is already bad, as noted previously, this is no worse than the UFS/ext3 case. If you want total freedom fro

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Toby Thain
On 25-May-09, at 11:16 PM, Frank Middleton wrote: On 05/22/09 21:08, Toby Thain wrote: Yes, the important thing is to *detect* them, no system can run reliably with bad memory, and that includes any system with ZFS. Doing nutty things like calculating the checksum twice does not buy anything

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Bob Friesenhahn
On Tue, 26 May 2009, Frank Middleton wrote: 1) could be fixed in the documentation - "ZFS should be used with caution on machines with no ECC since random bit flips can cause unrecoverable checksum failures on mirrored drives". Or "ZFS isn't supported on machines with memory that has no ECC".

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Frank Middleton
On 05/26/09 03:23, casper@sun.com wrote: And where exactly do you get the second good copy of the data? From the first. And if it is already bad, as noted previously, this is no worse than the UFS/ext3 case. If you want total freedom from this class of errors, use ECC. If you copy the c

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Frank Middleton
On 05/23/09 10:21, Richard Elling wrote: This forum is littered with claims of "zfs checksums are broken" where the root cause turned out to be faulty hardware or firmware in the data path. I think that before you should speculate on a redesign, we should get to the root cause. The hardware

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Casper . Dik
>On 05/22/09 21:08, Toby Thain wrote: >> Yes, the important thing is to *detect* them, no system can run reliably >> with bad memory, and that includes any system with ZFS. Doing nutty >> things like calculating the checksum twice does not buy anything of >> value here. > >All memory is "bad" if i

Re: [zfs-discuss] Errors on mirrored drive

2009-05-25 Thread Frank Middleton
On 05/22/09 21:08, Toby Thain wrote: Yes, the important thing is to *detect* them, no system can run reliably with bad memory, and that includes any system with ZFS. Doing nutty things like calculating the checksum twice does not buy anything of value here. All memory is "bad" if it doesn't hav

Re: [zfs-discuss] Errors on mirrored drive

2009-05-23 Thread Richard Elling
This forum is littered with claims of "zfs checksums are broken" where the root cause turned out to be faulty hardware or firmware in the data path. I think that before you should speculate on a redesign, we should get to the root cause. Frank Middleton wrote: There have been a number of th

Re: [zfs-discuss] Errors on mirrored drive

2009-05-23 Thread Joerg Schilling
casper@sun.com wrote: > > > >> If a memory that can pass diagnostics for 24 hours at a > >> stretch can cause glitches in huge datastreams, then IMO it > >> behooves ZFS to defend itself against them. Buffering disk > >> i/o on machines with no ECC seems like reasonably cheap > >> insurance ag

Re: [zfs-discuss] Errors on mirrored drive

2009-05-23 Thread Casper . Dik
>> If a memory that can pass diagnostics for 24 hours at a >> stretch can cause glitches in huge datastreams, then IMO it >> behooves ZFS to defend itself against them. Buffering disk >> i/o on machines with no ECC seems like reasonably cheap >> insurance against a whole class of errors, and coul

Re: [zfs-discuss] Errors on mirrored drive

2009-05-22 Thread Toby Thain
On 22-May-09, at 5:24 PM, Frank Middleton wrote: There have been a number of threads here on the reliability of ZFS in the face of flaky hardware. ZFS certainly runs well on decent (e.g., SPARC) hardware, but isn't it reasonable to expect it to run well on something less well engineered?

Re: [zfs-discuss] Errors on mirrored drive

2009-05-22 Thread Frank Middleton
There have been a number of threads here on the reliability of ZFS in the face of flaky hardware. ZFS certainly runs well on decent (e.g., SPARC) hardware, but isn't it reasonable to expect it to run well on something less well engineered? I am a real ZFS fan, and I'd hate to see folks trash it be

Re: [zfs-discuss] Errors on mirrored drive

2009-04-21 Thread Casper . Dik
>If there were permanently bad memory locations, surely the diagnostics >would reveal them. Here's an interesting paper on memory errors: >http://www.ece.rochester.edu/~mihuang/PAPERS/hotdep07.pdf >Given the inevitability of relatively frequent transient memory >errors, I would think it behooves

Re: [zfs-discuss] Errors on mirrored drive

2009-04-21 Thread Frank Middleton
On 04/17/09 12:37, casper@sun.com wrote: I'd like to submit an RFE suggesting that data + checksum be copied for mirrored writes, but I won't waste anyone's time doing so unless you think there is a point. One might argue that a machine this flaky should be retired, but it is actually working

Re: [zfs-discuss] Errors on mirrored drive

2009-04-17 Thread Toby Thain
On 17-Apr-09, at 11:49 AM, Frank Middleton wrote: ... One might argue that a machine this flaky should be retired, but it is actually working quite well, If it has bad memory, you won't get much useful work done on it until the memory is replaced - unless you want to risk your data with r

Re: [zfs-discuss] Errors on mirrored drive

2009-04-17 Thread Casper . Dik
>I'd like to submit an RFE suggesting that data + checksum be copied for >mirrored writes, but I won't waste anyone's time doing so unless you >think there is a point. One might argue that a machine this flaky should >be retired, but it is actually working quite well, and perhaps represents >not e

Re: [zfs-discuss] Errors on mirrored drive

2009-04-17 Thread Frank Middleton
On 04/16/09 04:39, casper@sun.com wrote: You really believe that the copy was copied and checksummed twice before writing to the disk? Of course not. Copying the data doesn't help; both pieces of memory need to be good. It's checksummed once. If OpenSolaris succeeds in being significant

Re: [zfs-discuss] Errors on mirrored drive

2009-04-16 Thread Richard Elling
Frank Middleton wrote: Experimenting with OpenSolaris on an elderly PC with equally elderly drives, zpool status shows errors after a pkg image-update followed by a scrub. It is entirely possible that one of these drives is flaky, but surely the whole point of a zfs mirror is to avoid this? It se

Re: [zfs-discuss] Errors on mirrored drive

2009-04-16 Thread Casper . Dik
>Quite. Sounds like an architectural problem. This old machine probably >doesn't have ecc memory (AFAIK still rare on most PCs), but it is on >a serial UPS and isolated from shocks, and this has happened more >than once. These drives on this machine recently passed both the purge >and verify cycl

Re: [zfs-discuss] Errors on mirrored drive

2009-04-15 Thread Toby Thain
On 15-Apr-09, at 8:31 PM, Frank Middleton wrote: On 04/15/09 14:30, Bob Friesenhahn wrote: On Wed, 15 Apr 2009, Frank Middleton wrote: zpool status shows errors after a pkg image-update followed by a scrub. If a corruption occured in the main memory, the backplane, or the disk controller

Re: [zfs-discuss] Errors on mirrored drive

2009-04-15 Thread Frank Middleton
On 04/15/09 14:30, Bob Friesenhahn wrote: On Wed, 15 Apr 2009, Frank Middleton wrote: zpool status shows errors after a pkg image-update followed by a scrub. If a corruption occured in the main memory, the backplane, or the disk controller during the writes to these files, then the original d

Re: [zfs-discuss] Errors on mirrored drive

2009-04-15 Thread Bob Friesenhahn
On Wed, 15 Apr 2009, Frank Middleton wrote: Experimenting with OpenSolaris on an elderly PC with equally elderly drives, zpool status shows errors after a pkg image-update followed by a scrub. It is entirely possible that one of these drives is flaky, but surely the whole point of a zfs mirror i

[zfs-discuss] Errors on mirrored drive

2009-04-15 Thread Frank Middleton
Experimenting with OpenSolaris on an elderly PC with equally elderly drives, zpool status shows errors after a pkg image-update followed by a scrub. It is entirely possible that one of these drives is flaky, but surely the whole point of a zfs mirror is to avoid this? It seems unlikely that both d