On 05/26/09 13:07, Kjetil Torgrim Homme wrote:
also thank you, all ZFS developers, for your great job :-)
I'll second that! A great achievement - puts Solaris in a league of
its own, so much so, you'd want to run it on all your hardware,
however crappy the hardware might be ;-)
There are too m
Frank Middleton writes:
> Exactly. My whole point. And without ECC there's no way of knowing.
> But if the data is damaged /after/ checksum but /before/ write, then
> you have a real problem...
we can't do much to protect ourselves from damage to the data itself
(an extra copy in RAM will help l
Frank brings up some interesting ideas, some of which might
need some additional thoughts...
Frank Middleton wrote:
On 05/23/09 10:21, Richard Elling wrote:
This forum is littered with claims of "zfs checksums are broken" where
the root cause turned out to be faulty hardware or firmware in the
On Tue, 26 May 2009, Frank Middleton wrote:
Just asking if an option for machines with no ecc and their inevitable
memory errors is a reasonable thing to suggest in an RFE.
Machines lacking ECC do not suffer from "inevitable memory errors".
Memory errors are not like death and taxes.
Exactl
Bob Friesenhahn wrote:
On Tue, 26 May 2009, Frank Middleton wrote:
Just asking if an option for machines with no ecc and their inevitable
memory errors is a reasonable thing to suggest in an RFE.
Machines lacking ECC do not suffer from "inevitable memory errors".
Memory errors are not like de
On 26-May-09, at 10:21 AM, Frank Middleton wrote:
On 05/26/09 03:23, casper@sun.com wrote:
And where exactly do you get the second good copy of the data?
From the first. And if it is already bad, as noted previously, this
is no worse than the UFS/ext3 case. If you want total freedom fro
On 25-May-09, at 11:16 PM, Frank Middleton wrote:
On 05/22/09 21:08, Toby Thain wrote:
Yes, the important thing is to *detect* them, no system can run
reliably
with bad memory, and that includes any system with ZFS. Doing nutty
things like calculating the checksum twice does not buy anything
On Tue, 26 May 2009, Frank Middleton wrote:
1) could be fixed in the documentation - "ZFS should be used with caution
on machines with no ECC since random bit flips can cause unrecoverable
checksum failures on mirrored drives". Or "ZFS isn't supported on
machines with memory that has no ECC".
On 05/26/09 03:23, casper@sun.com wrote:
And where exactly do you get the second good copy of the data?
From the first. And if it is already bad, as noted previously, this
is no worse than the UFS/ext3 case. If you want total freedom from
this class of errors, use ECC.
If you copy the c
On 05/23/09 10:21, Richard Elling wrote:
This forum is littered with claims of "zfs checksums are broken" where
the root cause turned out to be faulty hardware or firmware in the data
path.
I think that before you should speculate on a redesign, we should get to
the root cause.
The hardware
>On 05/22/09 21:08, Toby Thain wrote:
>> Yes, the important thing is to *detect* them, no system can run reliably
>> with bad memory, and that includes any system with ZFS. Doing nutty
>> things like calculating the checksum twice does not buy anything of
>> value here.
>
>All memory is "bad" if i
On 05/22/09 21:08, Toby Thain wrote:
Yes, the important thing is to *detect* them, no system can run reliably
with bad memory, and that includes any system with ZFS. Doing nutty
things like calculating the checksum twice does not buy anything of
value here.
All memory is "bad" if it doesn't hav
This forum is littered with claims of "zfs checksums are broken" where
the root cause turned out to be faulty hardware or firmware in the data
path.
I think that before you should speculate on a redesign, we should get to
the root cause.
Frank Middleton wrote:
There have been a number of th
casper@sun.com wrote:
>
>
> >> If a memory that can pass diagnostics for 24 hours at a
> >> stretch can cause glitches in huge datastreams, then IMO it
> >> behooves ZFS to defend itself against them. Buffering disk
> >> i/o on machines with no ECC seems like reasonably cheap
> >> insurance ag
>> If a memory that can pass diagnostics for 24 hours at a
>> stretch can cause glitches in huge datastreams, then IMO it
>> behooves ZFS to defend itself against them. Buffering disk
>> i/o on machines with no ECC seems like reasonably cheap
>> insurance against a whole class of errors, and coul
On 22-May-09, at 5:24 PM, Frank Middleton wrote:
There have been a number of threads here on the reliability of ZFS
in the
face of flaky hardware. ZFS certainly runs well on decent (e.g.,
SPARC)
hardware, but isn't it reasonable to expect it to run well on
something
less well engineered?
There have been a number of threads here on the reliability of ZFS in the
face of flaky hardware. ZFS certainly runs well on decent (e.g., SPARC)
hardware, but isn't it reasonable to expect it to run well on something
less well engineered? I am a real ZFS fan, and I'd hate to see folks
trash it be
>If there were permanently bad memory locations, surely the diagnostics
>would reveal them. Here's an interesting paper on memory errors:
>http://www.ece.rochester.edu/~mihuang/PAPERS/hotdep07.pdf
>Given the inevitability of relatively frequent transient memory
>errors, I would think it behooves
On 04/17/09 12:37, casper@sun.com wrote:
I'd like to submit an RFE suggesting that data + checksum be copied for
mirrored writes, but I won't waste anyone's time doing so unless you
think there is a point. One might argue that a machine this flaky should
be retired, but it is actually working
On 17-Apr-09, at 11:49 AM, Frank Middleton wrote:
... One might argue that a machine this flaky should
be retired, but it is actually working quite well,
If it has bad memory, you won't get much useful work done on it until
the memory is replaced - unless you want to risk your data with
r
>I'd like to submit an RFE suggesting that data + checksum be copied for
>mirrored writes, but I won't waste anyone's time doing so unless you
>think there is a point. One might argue that a machine this flaky should
>be retired, but it is actually working quite well, and perhaps represents
>not e
On 04/16/09 04:39, casper@sun.com wrote:
You really believe that the copy was copied and checksummed twice before
writing to the disk? Of course not. Copying the data doesn't help;
both pieces of memory need to be good. It's checksummed once.
If OpenSolaris succeeds in being significant
Frank Middleton wrote:
Experimenting with OpenSolaris on an elderly PC with equally
elderly drives, zpool status shows errors after a pkg image-update
followed by a scrub. It is entirely possible that one of these
drives is flaky, but surely the whole point of a zfs mirror is
to avoid this? It se
>Quite. Sounds like an architectural problem. This old machine probably
>doesn't have ecc memory (AFAIK still rare on most PCs), but it is on
>a serial UPS and isolated from shocks, and this has happened more
>than once. These drives on this machine recently passed both the purge
>and verify cycl
On 15-Apr-09, at 8:31 PM, Frank Middleton wrote:
On 04/15/09 14:30, Bob Friesenhahn wrote:
On Wed, 15 Apr 2009, Frank Middleton wrote:
zpool status shows errors after a pkg image-update
followed by a scrub.
If a corruption occured in the main memory, the backplane, or the
disk
controller
On 04/15/09 14:30, Bob Friesenhahn wrote:
On Wed, 15 Apr 2009, Frank Middleton wrote:
zpool status shows errors after a pkg image-update
followed by a scrub.
If a corruption occured in the main memory, the backplane, or the disk
controller during the writes to these files, then the original d
On Wed, 15 Apr 2009, Frank Middleton wrote:
Experimenting with OpenSolaris on an elderly PC with equally
elderly drives, zpool status shows errors after a pkg image-update
followed by a scrub. It is entirely possible that one of these
drives is flaky, but surely the whole point of a zfs mirror i
Experimenting with OpenSolaris on an elderly PC with equally
elderly drives, zpool status shows errors after a pkg image-update
followed by a scrub. It is entirely possible that one of these
drives is flaky, but surely the whole point of a zfs mirror is
to avoid this? It seems unlikely that both d
28 matches
Mail list logo