Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk

2008-08-29 Thread Todd H. Poole
> Let's not be too quick to assign blame, or to think that perfecting > the behaviour is straightforward or even possible. > > Start introducing random $20 components and you begin to dilute the > quality and predictability of the composite system's behaviour. > > But this NEVER happens on linux *g

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk

2008-08-28 Thread Anton B. Rang
Many mid-range/high-end RAID controllers work by having a small timeout on individual disk I/O operations. If the disk doesn't respond quickly, they'll issue an I/O to the redundant disk(s) to get the data back to the host in a reasonable time. Often they'll change parameters on the disk to limi

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk

2008-08-28 Thread Anton B. Rang
Many mid-range/high-end RAID controllers work by having a small timeout on individual disk I/O operations. If the disk doesn't respond quickly, they'll issue an I/O to the redundant disk(s) to get the data back to the host in a reasonable time. Often they'll change parameters on the disk to limi

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Miles Nordin
> "jl" == Jonathan Loran <[EMAIL PROTECTED]> writes: jl> Fe = 46% failures/month * 12 months = 5.52 failures the original statistic wasn't of this kind. It was ``likelihood a single drive will experience one or more failures within 12 months''. so, you could say, ``If I have a thousan

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Jonathan Loran
Miles Nordin wrote: > What is a ``failure rate for a time interval''? > > Failure rate => Failures/unit time Failure rate for a time interval => (Failures/unit time) * time For example, if we have a failure rate: Fr = 46% failures/month Then the expectation value of a failure in one year

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Miles Nordin
> "rm" == Robert Milkowski <[EMAIL PROTECTED]> writes: rm> Please look for slides 23-27 at rm> http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf yeah, ok, ONCE AGAIN, I never said that checksums are worthless. relling: some drives don't return errors on unrecoverable read even

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Miles Nordin
> "re" == Richard Elling <[EMAIL PROTECTED]> writes: re> There is no error in my math. I presented a failure rate for re> a time interval, What is a ``failure rate for a time interval''? AIUI, the failure rate for a time interval is 0.46% / yr, no matter how many drives you have.

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Richard Elling
Robert Milkowski wrote: > Hello Miles, > > Wednesday, August 27, 2008, 10:51:49 PM, you wrote: > > MN> It's not really enough for me, but what's more the case doesn't match > MN> what we were looking for: a device which ``never returns error codes, > MN> always returns silently bad data.'' I asked

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Robert Milkowski
Hello Miles, Wednesday, August 27, 2008, 10:51:49 PM, you wrote: MN> It's not really enough for me, but what's more the case doesn't match MN> what we were looking for: a device which ``never returns error codes, MN> always returns silently bad data.'' I asked for this because you said MN> ``How

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Richard Elling
Miles Nordin wrote: > re> Indeed. Intuitively, the AFR and population is more easily > re> grokked by the masses. > > It's nothing to do with masses. There's an error in your math. It's > not right under any circumstance. > There is no error in my math. I presented a failure rate fo

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Miles Nordin
> "re" == Richard Elling <[EMAIL PROTECTED]> writes: re> I really don't know how to please you. dd from the raw device instead of through ZFS would be better. If you could show that you can write data to a sector, and read back different data, without getting an error, over and over, I'd

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Ian Collins
Richard Elling writes: > Ian Collins wrote: >> Richard Elling writes: >>> >>> I think the proposed timeouts here are too short, but the idea has >>> merit. Note that such a preemptive read will have negative performance >>> impacts for high-workload systems, so it will not be a given that peopl

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Richard Elling
Ian Collins wrote: > Richard Elling writes: >> >> I think the proposed timeouts here are too short, but the idea has >> merit. Note that such a preemptive read will have negative performance >> impacts for high-workload systems, so it will not be a given that people >> will want this enabled by de

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Ian Collins
Richard Elling writes: > > I think the proposed timeouts here are too short, but the idea has > merit. Note that such a preemptive read will have negative performance > impacts for high-workload systems, so it will not be a given that people > will want this enabled by default. Designing such a

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Richard Elling
Miles Nordin wrote: >> "re" == Richard Elling <[EMAIL PROTECTED]> writes: >> > > >> If you really mean there are devices out there which never > >> return error codes, and always silently return bad data, please > >> tell us which one and the story of when you encou

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Ian Collins
Toby Thain writes: > > On 27-Aug-08, at 7:21 PM, Ian Collins wrote: > >> Miles Nordin writes: >> >>> >>> In addition, I'm repeating myself like crazy at this point, but ZFS >>> tools used for all pools like 'zpool status' need to not freeze when a >>> single pool, or single device within a

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Miles Nordin
> "t" == Tim <[EMAIL PROTECTED]> writes: t> Except he was, and is referring to a non-root disk. wait, what? his root disk isn't plugged into the pci-ide controller? t> LVM hardly changes the way devices move around in Linux, fine, be pedantic. It makes systems boot and mount al

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Tim
On Wed, Aug 27, 2008 at 5:39 PM, Toby Thain <[EMAIL PROTECTED]>wrote: > > > Let's not be too quick to assign blame, or to think that perfecting > the behaviour is straightforward or even possible. > > Traditionally, systems bearing 'enterprisey' expectations were/are > integrated hardware and soft

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Bob Friesenhahn
On Wed, 27 Aug 2008, Miles Nordin wrote: > > In some sense the disk drivers and ZFS have different goals. The goal > of drivers should be to keep marginal disk/cabling/... subsystems > online as aggressively as possible, while the goal of ZFS should be to > notice and work around slightly-failing

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Tim
On Wed, Aug 27, 2008 at 5:33 PM, Miles Nordin <[EMAIL PROTECTED]> wrote: > > "t" == Tim <[EMAIL PROTECTED]> writes: > > t> Solaris does not do this. > > yeah but the locators for local disks are still based on > pci/controller/channel not devid, so the disk will move to a different > devi

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Toby Thain
On 27-Aug-08, at 7:21 PM, Ian Collins wrote: > Miles Nordin writes: > >> >> In addition, I'm repeating myself like crazy at this point, but ZFS >> tools used for all pools like 'zpool status' need to not freeze >> when a >> single pool, or single device within a pool, is unavailable or slow, >>

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Miles Nordin
> "t" == Tim <[EMAIL PROTECTED]> writes: t> Solaris does not do this. yeah but the locators for local disks are still based on pci/controller/channel not devid, so the disk will move to a different device name if he changes BIOS from pci-ide to AHCI because it changes the driver attachm

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Ian Collins
Miles Nordin writes: > > In addition, I'm repeating myself like crazy at this point, but ZFS > tools used for all pools like 'zpool status' need to not freeze when a > single pool, or single device within a pool, is unavailable or slow, > and this expectation is having nothing to do with failmod

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Miles Nordin
> "re" == Richard Elling <[EMAIL PROTECTED]> writes: >> If you really mean there are devices out there which never >> return error codes, and always silently return bad data, please >> tell us which one and the story of when you encountered it, re> I blogged about one such cas

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Tim
On Wed, Aug 27, 2008 at 1:31 PM, Ross <[EMAIL PROTECTED]> wrote: > Forgive me for being a bit wooly with this explanation (I've only recently > moved over from Windows), but changing disk mode from IDE to SATA may well > not work without a re-install, or at the very least messing around with boot

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Ross
Forgive me for being a bit wooly with this explanation (I've only recently moved over from Windows), but changing disk mode from IDE to SATA may well not work without a re-install, or at the very least messing around with boot settings. I've seen many systems which list SATA disks in front of I

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Richard Elling
Miles Nordin wrote: >> "re" == Richard Elling <[EMAIL PROTECTED]> writes: >> > > re> not all devices return error codes which indicate > re> unrecoverable reads. > > What you mean is, ``devices sometimes return bad data instead of an > error code.'' > > If you really me

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Keith Bierman
On Aug 27, 2008, at 11:17 AM, Richard Elling wrote: In my pile of broken parts, I have devices > which fail to indicate an unrecoverable read, yet do indeed suffer > from forgetful media. A long time ago, in a hw company long since dead and buried, I spent some months trying to find an i

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Miles Nordin
> "re" == Richard Elling <[EMAIL PROTECTED]> writes: re> not all devices return error codes which indicate re> unrecoverable reads. What you mean is, ``devices sometimes return bad data instead of an error code.'' If you really mean there are devices out there which never return erro

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Richard Elling
Mattias Pantzare wrote: > 2008/8/27 Richard Elling <[EMAIL PROTECTED]>: > > Either the drives should be loaded with special firmware that > returns errors earlier, or the software LVM should read redundant data > and collect the statistic if the drive is well outside its usual >

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Tim
> > > By the way: Is there a way to pull up a text-only interface from the log in > screen (or during the boot process?) without having to log in (or just sit > there reading about "SunOS Release 5.11 Version snv_86 64-bit")? It would be > nice if I could see a bit more information during boot, or

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Mattias Pantzare
2008/8/27 Richard Elling <[EMAIL PROTECTED]>: > Either the drives should be loaded with special firmware that returns errors earlier, or the software LVM should read redundant data and collect the statistic if the drive is well outside its usual response latency. >>> >>> ZF

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-26 Thread Todd H. Poole
Howdy Ron, Right, right - I know I dropped the ball on that one. Sorry, I haven't been able to log into OpenSolaris lately, and thus haven't been able to actually do anything useful... (lol, not to rag on OpenSolaris or anything, but it can also freeze just by logging in... See: http://defect.

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-26 Thread MC
> Pulling cables only simulates pulling cables. If you > are having difficulty with cables falling out, then this problem cannot > be solved with software. It *must* be solved with hardware. I don't think anyone is asking for software to fix cables that fall out... they're asking for the OS to no

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-26 Thread Richard Elling
Mattias Pantzare wrote: > 2008/8/26 Richard Elling <[EMAIL PROTECTED]>: > >>> Doing a good job with this error is mostly about not freezing >>> the whole filesystem for the 30sec it takes the drive to report the >>> error. >>> >> That is not a ZFS problem. Please file bugs in the appropr

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-26 Thread Mattias Pantzare
2008/8/26 Richard Elling <[EMAIL PROTECTED]>: > >> Doing a good job with this error is mostly about not freezing >> the whole filesystem for the 30sec it takes the drive to report the >> error. > > That is not a ZFS problem. Please file bugs in the appropriate category. Who's problem is it? It ca

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-26 Thread Richard Elling
Miles Nordin wrote: >> "re" == Richard Elling <[EMAIL PROTECTED]> writes: >> > > re> unrecoverable read as the dominant disk failure mode. [...] > re> none of the traditional software logical volume managers nor > re> the popular open source file systems (other than

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-26 Thread Ron Halstead
Todd, 3 days ago you were asked what mode the BIOS was using, AHCI or IDE compatibility. Which is it? Did you change it? What was the result? A few other posters suggested the same thing but the thread went off into left field and I believe the question / suggestions got lost in the noise. --ro

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-26 Thread Richard Elling
Carson Gaspar wrote: > Richard Elling wrote: > >> No snake oil. Pulling cables only simulates pulling cables. If you >> are having difficulty with cables falling out, then this problem cannot >> be solved with software. It *must* be solved with hardware. >> >> But the main problem with "simul

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-26 Thread Carson Gaspar
Richard Elling wrote: > > No snake oil. Pulling cables only simulates pulling cables. If you > are having difficulty with cables falling out, then this problem cannot > be solved with software. It *must* be solved with hardware. > > But the main problem with "simulating disk failures by pulling

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-26 Thread Miles Nordin
> "re" == Richard Elling <[EMAIL PROTECTED]> writes: re> unrecoverable read as the dominant disk failure mode. [...] re> none of the traditional software logical volume managers nor re> the popular open source file systems (other than ZFS :-) re> address this problem. Other LV

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-26 Thread Richard Elling
Miles Nordin wrote: >> "jcm" == James C McPherson <[EMAIL PROTECTED]> writes: >> "thp" == Todd H Poole <[EMAIL PROTECTED]> writes: >> "mh" == Matt Harrison <[EMAIL PROTECTED]> writes: >> "js" == John Sonnenschein <[EMAIL PROTECTED]> writes: >> "re" == Richard Elling <[EMAIL PROT