> Let's not be too quick to assign blame, or to think that perfecting
> the behaviour is straightforward or even possible.
>
> Start introducing random $20 components and you begin to dilute the
> quality and predictability of the composite system's behaviour.
>
> But this NEVER happens on linux *g
Many mid-range/high-end RAID controllers work by having a small timeout on
individual disk I/O operations. If the disk doesn't respond quickly, they'll
issue an I/O to the redundant disk(s) to get the data back to the host in a
reasonable time. Often they'll change parameters on the disk to limi
Many mid-range/high-end RAID controllers work by having a small timeout on
individual disk I/O operations. If the disk doesn't respond quickly, they'll
issue an I/O to the redundant disk(s) to get the data back to the host in a
reasonable time. Often they'll change parameters on the disk to limi
> "jl" == Jonathan Loran <[EMAIL PROTECTED]> writes:
jl> Fe = 46% failures/month * 12 months = 5.52 failures
the original statistic wasn't of this kind. It was ``likelihood a
single drive will experience one or more failures within 12 months''.
so, you could say, ``If I have a thousan
Miles Nordin wrote:
> What is a ``failure rate for a time interval''?
>
>
Failure rate => Failures/unit time
Failure rate for a time interval => (Failures/unit time) * time
For example, if we have a failure rate:
Fr = 46% failures/month
Then the expectation value of a failure in one year
> "rm" == Robert Milkowski <[EMAIL PROTECTED]> writes:
rm> Please look for slides 23-27 at
rm> http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf
yeah, ok, ONCE AGAIN, I never said that checksums are worthless.
relling: some drives don't return errors on unrecoverable read even
> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
re> There is no error in my math. I presented a failure rate for
re> a time interval,
What is a ``failure rate for a time interval''?
AIUI, the failure rate for a time interval is 0.46% / yr, no matter how
many drives you have.
Robert Milkowski wrote:
> Hello Miles,
>
> Wednesday, August 27, 2008, 10:51:49 PM, you wrote:
>
> MN> It's not really enough for me, but what's more the case doesn't match
> MN> what we were looking for: a device which ``never returns error codes,
> MN> always returns silently bad data.'' I asked
Hello Miles,
Wednesday, August 27, 2008, 10:51:49 PM, you wrote:
MN> It's not really enough for me, but what's more the case doesn't match
MN> what we were looking for: a device which ``never returns error codes,
MN> always returns silently bad data.'' I asked for this because you said
MN> ``How
Miles Nordin wrote:
> re> Indeed. Intuitively, the AFR and population is more easily
> re> grokked by the masses.
>
> It's nothing to do with masses. There's an error in your math. It's
> not right under any circumstance.
>
There is no error in my math. I presented a failure rate fo
> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
re> I really don't know how to please you.
dd from the raw device instead of through ZFS would be better. If you
could show that you can write data to a sector, and read back
different data, without getting an error, over and over, I'd
Richard Elling writes:
> Ian Collins wrote:
>> Richard Elling writes:
>>>
>>> I think the proposed timeouts here are too short, but the idea has
>>> merit. Note that such a preemptive read will have negative performance
>>> impacts for high-workload systems, so it will not be a given that peopl
Ian Collins wrote:
> Richard Elling writes:
>>
>> I think the proposed timeouts here are too short, but the idea has
>> merit. Note that such a preemptive read will have negative performance
>> impacts for high-workload systems, so it will not be a given that people
>> will want this enabled by de
Richard Elling writes:
>
> I think the proposed timeouts here are too short, but the idea has
> merit. Note that such a preemptive read will have negative performance
> impacts for high-workload systems, so it will not be a given that people
> will want this enabled by default. Designing such a
Miles Nordin wrote:
>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
>>
>
> >> If you really mean there are devices out there which never
> >> return error codes, and always silently return bad data, please
> >> tell us which one and the story of when you encou
Toby Thain writes:
>
> On 27-Aug-08, at 7:21 PM, Ian Collins wrote:
>
>> Miles Nordin writes:
>>
>>>
>>> In addition, I'm repeating myself like crazy at this point, but ZFS
>>> tools used for all pools like 'zpool status' need to not freeze when a
>>> single pool, or single device within a
> "t" == Tim <[EMAIL PROTECTED]> writes:
t> Except he was, and is referring to a non-root disk.
wait, what? his root disk isn't plugged into the pci-ide controller?
t> LVM hardly changes the way devices move around in Linux,
fine, be pedantic. It makes systems boot and mount al
On Wed, Aug 27, 2008 at 5:39 PM, Toby Thain <[EMAIL PROTECTED]>wrote:
>
>
> Let's not be too quick to assign blame, or to think that perfecting
> the behaviour is straightforward or even possible.
>
> Traditionally, systems bearing 'enterprisey' expectations were/are
> integrated hardware and soft
On Wed, 27 Aug 2008, Miles Nordin wrote:
>
> In some sense the disk drivers and ZFS have different goals. The goal
> of drivers should be to keep marginal disk/cabling/... subsystems
> online as aggressively as possible, while the goal of ZFS should be to
> notice and work around slightly-failing
On Wed, Aug 27, 2008 at 5:33 PM, Miles Nordin <[EMAIL PROTECTED]> wrote:
> > "t" == Tim <[EMAIL PROTECTED]> writes:
>
> t> Solaris does not do this.
>
> yeah but the locators for local disks are still based on
> pci/controller/channel not devid, so the disk will move to a different
> devi
On 27-Aug-08, at 7:21 PM, Ian Collins wrote:
> Miles Nordin writes:
>
>>
>> In addition, I'm repeating myself like crazy at this point, but ZFS
>> tools used for all pools like 'zpool status' need to not freeze
>> when a
>> single pool, or single device within a pool, is unavailable or slow,
>>
> "t" == Tim <[EMAIL PROTECTED]> writes:
t> Solaris does not do this.
yeah but the locators for local disks are still based on
pci/controller/channel not devid, so the disk will move to a different
device name if he changes BIOS from pci-ide to AHCI because it changes
the driver attachm
Miles Nordin writes:
>
> In addition, I'm repeating myself like crazy at this point, but ZFS
> tools used for all pools like 'zpool status' need to not freeze when a
> single pool, or single device within a pool, is unavailable or slow,
> and this expectation is having nothing to do with failmod
> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
>> If you really mean there are devices out there which never
>> return error codes, and always silently return bad data, please
>> tell us which one and the story of when you encountered it,
re> I blogged about one such cas
On Wed, Aug 27, 2008 at 1:31 PM, Ross <[EMAIL PROTECTED]> wrote:
> Forgive me for being a bit wooly with this explanation (I've only recently
> moved over from Windows), but changing disk mode from IDE to SATA may well
> not work without a re-install, or at the very least messing around with boot
Forgive me for being a bit wooly with this explanation (I've only recently
moved over from Windows), but changing disk mode from IDE to SATA may well not
work without a re-install, or at the very least messing around with boot
settings. I've seen many systems which list SATA disks in front of I
Miles Nordin wrote:
>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
>>
>
> re> not all devices return error codes which indicate
> re> unrecoverable reads.
>
> What you mean is, ``devices sometimes return bad data instead of an
> error code.''
>
> If you really me
On Aug 27, 2008, at 11:17 AM, Richard Elling wrote:
In my pile of broken parts, I have devices
> which fail to indicate an unrecoverable read, yet do indeed suffer
> from forgetful media.
A long time ago, in a hw company long since dead and buried, I spent
some months trying to find an i
> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
re> not all devices return error codes which indicate
re> unrecoverable reads.
What you mean is, ``devices sometimes return bad data instead of an
error code.''
If you really mean there are devices out there which never return
erro
Mattias Pantzare wrote:
> 2008/8/27 Richard Elling <[EMAIL PROTECTED]>:
>
> Either the drives should be loaded with special firmware that
> returns errors earlier, or the software LVM should read redundant data
> and collect the statistic if the drive is well outside its usual
>
>
>
> By the way: Is there a way to pull up a text-only interface from the log in
> screen (or during the boot process?) without having to log in (or just sit
> there reading about "SunOS Release 5.11 Version snv_86 64-bit")? It would be
> nice if I could see a bit more information during boot, or
2008/8/27 Richard Elling <[EMAIL PROTECTED]>:
>
Either the drives should be loaded with special firmware that
returns errors earlier, or the software LVM should read redundant data
and collect the statistic if the drive is well outside its usual
response latency.
>>>
>>> ZF
Howdy Ron,
Right, right - I know I dropped the ball on that one. Sorry, I haven't been
able to log into OpenSolaris lately, and thus haven't been able to actually do
anything useful... (lol, not to rag on OpenSolaris or anything, but it can also
freeze just by logging in... See:
http://defect.
> Pulling cables only simulates pulling cables. If you
> are having difficulty with cables falling out, then this problem cannot
> be solved with software. It *must* be solved with hardware.
I don't think anyone is asking for software to fix cables that fall out...
they're asking for the OS to no
Mattias Pantzare wrote:
> 2008/8/26 Richard Elling <[EMAIL PROTECTED]>:
>
>>> Doing a good job with this error is mostly about not freezing
>>> the whole filesystem for the 30sec it takes the drive to report the
>>> error.
>>>
>> That is not a ZFS problem. Please file bugs in the appropr
2008/8/26 Richard Elling <[EMAIL PROTECTED]>:
>
>> Doing a good job with this error is mostly about not freezing
>> the whole filesystem for the 30sec it takes the drive to report the
>> error.
>
> That is not a ZFS problem. Please file bugs in the appropriate category.
Who's problem is it? It ca
Miles Nordin wrote:
>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
>>
>
> re> unrecoverable read as the dominant disk failure mode. [...]
> re> none of the traditional software logical volume managers nor
> re> the popular open source file systems (other than
Todd, 3 days ago you were asked what mode the BIOS was using, AHCI or IDE
compatibility. Which is it? Did you change it? What was the result? A few other
posters suggested the same thing but the thread went off into left field and I
believe the question / suggestions got lost in the noise.
--ro
Carson Gaspar wrote:
> Richard Elling wrote:
>
>> No snake oil. Pulling cables only simulates pulling cables. If you
>> are having difficulty with cables falling out, then this problem cannot
>> be solved with software. It *must* be solved with hardware.
>>
>> But the main problem with "simul
Richard Elling wrote:
>
> No snake oil. Pulling cables only simulates pulling cables. If you
> are having difficulty with cables falling out, then this problem cannot
> be solved with software. It *must* be solved with hardware.
>
> But the main problem with "simulating disk failures by pulling
> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
re> unrecoverable read as the dominant disk failure mode. [...]
re> none of the traditional software logical volume managers nor
re> the popular open source file systems (other than ZFS :-)
re> address this problem.
Other LV
Miles Nordin wrote:
>> "jcm" == James C McPherson <[EMAIL PROTECTED]> writes:
>> "thp" == Todd H Poole <[EMAIL PROTECTED]> writes:
>> "mh" == Matt Harrison <[EMAIL PROTECTED]> writes:
>> "js" == John Sonnenschein <[EMAIL PROTECTED]> writes:
>> "re" == Richard Elling <[EMAIL PROT
42 matches
Mail list logo