Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Miles Nordin
> "r" == Ross <[EMAIL PROTECTED]> writes: rs> I don't think it likes it if the iscsi targets aren't rs> available during boot. from my cheatsheet: -8<- ok boot -m milestone=none [boots. enter root password for maintenance.] bash-3.00# /sbin/mount -o remount,rw / [<-- otherw

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Ross Smith
Yeah, thanks Maurice, I just saw that one this afternoon. I guess you can't reboot with iscsi full stop... o_0 And I've seen the iscsi bug before (I was just too lazy to look it up lol), I've been complaining about that since February. In fact it's been a bad week for iscsi here, I've managed to

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Maurice Volaski
>2. With iscsi, you can't reboot with sendtargets enabled, static >discovery still seems to be the order of the day. I'm seeing this problem with static discovery: http://bugs.opensolaris.org/view_bug.do?bug_id=6775008. >4. iSCSI still has a 3 minute timeout, during which time your pool >wil

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Ross
Ok, I've done some more testing today and I almost don't know where to start. I'll begin with the good news for Miles :) - Rebooting doesn't appear to cause ZFS to loose the resilver status (but see 1. below) - Resilvering appears to work fine, once complete I never saw any checksum errors when

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Toby Thain
On 2-Dec-08, at 3:35 PM, Miles Nordin wrote: >> "r" == Ross <[EMAIL PROTECTED]> writes: > > r> style before I got half way through your post :) [...status > r> problems...] could be a case of oversimplifying things. > ... > And yes, this is a religious argument. Just because it sp

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Miles Nordin
> "r" == Ross <[EMAIL PROTECTED]> writes: r> style before I got half way through your post :) [...status r> problems...] could be a case of oversimplifying things. yeah I was a bit inappropriate, but my frustration comes from the (partly paranoid) imagining of how the idea ``we nee

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross
Hi Miles, It's probably a bad sign that although that post came through as anonymous in my e-mail, I recognised your style before I got half way through your post :) I agree, the zpool status being out of date is weird, I'll dig out the bug number for that at some point as I'm sure I've mention

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Miles Nordin
> "rs" == Ross Smith <[EMAIL PROTECTED]> writes: rs> 4. zpool status still reports out of date information. I know people are going to skim this message and not hear this. They'll say ``well of course zpool status says ONLINE while the pool is hung. ZFS is patiently waiting. It doesn't

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross Smith
Hi Richard, Thanks, I'll give that a try. I think I just had a kernel dump while trying to boot this system back up though, I don't think it likes it if the iscsi targets aren't available during boot. Again, that rings a bell, so I'll go see if that's another known bug. Changing that setting on

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross
Incidentally, while I've reported this again as a RFE, I still haven't seen a CR number for this. Could somebody from Sun check if it's been filed please. thanks, Ross -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discu

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross Smith
Hey folks, I've just followed up on this, testing iSCSI with a raided pool, and it still appears to be struggling when a device goes offline. >>> I don't see how this could work except for mirrored pools. Would that >>> carry enough market to be worthwhile? >>> -- richard >>> >> >> I have to adm

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-28 Thread Richard Elling
Ross Smith wrote: > On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling <[EMAIL PROTECTED]> wrote: > >> Ross wrote: >> >>> Well, you're not alone in wanting to use ZFS and iSCSI like that, and in >>> fact my change request suggested that this is exactly one of the things that >>> could be addre

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross Smith
On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling <[EMAIL PROTECTED]> wrote: > Ross wrote: >> >> Well, you're not alone in wanting to use ZFS and iSCSI like that, and in >> fact my change request suggested that this is exactly one of the things that >> could be addressed: >> >> "The idea is really a

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Richard Elling
Ross wrote: > Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact > my change request suggested that this is exactly one of the things that could > be addressed: > > "The idea is really a two stage RFE, since just the first part would have > benefits. The key is to imp

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Bernard Dugas
> Well, you're not alone in wanting to use ZFS and > iSCSI like that, and in fact my change request > suggested that this is exactly one of the things that > could be addressed: Thank you ! Yes, this was also to tell you that you are not alone :-) I agree completely with you on your technical poi

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross
Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: "The idea is really a two stage RFE, since just the first part would have benefits. The key is to improve ZFS availability,

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Bernard Dugas
Hello, Thank you for this very interesting thread ! I want to confirm that Synchronous Distributed Storage is main goal when using ZFS ! The target architecture is 1 local drive, and 2 (or more) remote iSCSI targets, with ZFS being the iSCSI initiator. System is designed/cut so that local dis

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross
Thanks James, I've e-mailed Alan and submitted this one again. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread James C. McPherson
On Thu, 27 Nov 2008 04:33:54 -0800 (PST) Ross <[EMAIL PROTECTED]> wrote: > Hmm... I logged this CR ages ago, but now I've come to find it in > the bug tracker I can't see it anywhere. > > I actually logged three CR's back to back, the first appears to have > been created ok, but two have just di

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross
Hmm... I logged this CR ages ago, but now I've come to find it in the bug tracker I can't see it anywhere. I actually logged three CR's back to back, the first appears to have been created ok, but two have just disappeared. The one I created ok is: http://bugs.opensolaris.org/view_bug.do?bug

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-09-06 Thread Richard Elling
Ross wrote: > Hey folks, > > Well, there haven't been any more comments knocking holes in this idea, so > I'm wondering now if I should log this as an RFE? > go for it! > Is this something others would find useful? > Yes. But remember that this has a very limited scope. Basically it

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-09-05 Thread Ross
Hey folks, Well, there haven't been any more comments knocking holes in this idea, so I'm wondering now if I should log this as an RFE? Is this something others would find useful? Ross -- This message posted from opensolaris.org ___ zfs-discuss mail

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-09-02 Thread Ross Smith
Thinking about it, we could make use of this too. The ability to add a remote iSCSI mirror to any pool without sacrificing local performance could be a huge benefit. > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > CC: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org > Subject: Re: Availabilit

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-31 Thread Richard Elling
Ross Smith wrote: > Triple mirroring you say? That'd be me then :D > > The reason I really want to get ZFS timeouts sorted is that our long > term goal is to mirror that over two servers too, giving us a pool > mirrored across two servers, each of which is actually a zfs iscsi > volume hosted o

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-31 Thread Johan Hartzenberg
On Thu, Aug 28, 2008 at 11:21 PM, Ian Collins <[EMAIL PROTECTED]> wrote: > Miles Nordin writes: > > > suggested that unlike the SVM feature it should be automatic, because > > by so being it becomes useful as an availability tool rather than just > > performance optimisation. > > > So on a server

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ian Collins
Miles Nordin writes: >> "bf" == Bob Friesenhahn <[EMAIL PROTECTED]> writes: > > bf> You are saying that I can't split my mirrors between a local > bf> disk in Dallas and a remote disk in New York accessed via > bf> iSCSI? > > nope, you've misread. I'm saying reads should go to

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ian Collins
Eric Schrock writes: > > A better option would be to not use this to perform FMA diagnosis, but > instead work into the mirror child selection code. This has already > been alluded to before, but it would be cool to keep track of latency > over time, and use this to both a) prefer one drive over

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ross Smith
lt, with the optional setting being to allow the pool to continue accepting writes while the pool is in a non redundant state. Ross > Date: Sat, 30 Aug 2008 10:59:19 -0500 > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > CC: zfs-discuss@opensolaris.org > Subject: Re: [zfs-disc

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Bob Friesenhahn
On Sat, 30 Aug 2008, Ross wrote: > while the problem is diagnosed. - With that said, could the write > timeout default to on when you have a slog device? After all, the > data is safely committed to the slog, and should remain there until > it's written to all devices. Bob, you seemed the most

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ross
Wow, some great comments on here now, even a few people agreeing with me which is nice :D I'll happily admit I don't have the in depth understanding of storage many of you guys have, but since the idea doesn't seem pie-in-the-sky crazy, I'm going to try to write up all my current thoughts on ho

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Richard Elling
Miles Nordin wrote: >> "re" == Richard Elling <[EMAIL PROTECTED]> writes: >> > > re> if you use Ethernet switches in the interconnect, you need to > re> disable STP on the ports used for interconnects or risk > re> unnecessary cluster reconfigurations. > > RSTP/802.

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Miles Nordin
> "re" == Richard Elling <[EMAIL PROTECTED]> writes: re> if you use Ethernet switches in the interconnect, you need to re> disable STP on the ports used for interconnects or risk re> unnecessary cluster reconfigurations. RSTP/802.1w plus setting the ports connected to Solaris as `

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Bob Friesenhahn
On Fri, 29 Aug 2008, Miles Nordin wrote: > > I guess I'm changing my story slightly. I *would* want ZFS to collect > drive performance statistics and report them to FMA, but I wouldn't Your email *totally* blew my limited buffer size, but this little bit remained for me to look at. It left me w

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Miles Nordin
> "es" == Eric Schrock <[EMAIL PROTECTED]> writes: es> The main problem with exposing tunables like this is that they es> have a direct correlation to service actions, and es> mis-diagnosing failures costs everybody (admin, companies, es> Sun, etc) lots of time and money. Once

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Richard Elling
Nicolas Williams wrote: > On Thu, Aug 28, 2008 at 11:29:21AM -0500, Bob Friesenhahn wrote: > >> Which of these do you prefer? >> >>o System waits substantial time for devices to (possibly) recover in >> order to ensure that subsequently written data has the least >> chance of being

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Nicolas Williams
On Thu, Aug 28, 2008 at 01:05:54PM -0700, Eric Schrock wrote: > As others have mentioned, things get more difficult with writes. If I > issue a write to both halves of a mirror, should I return when the first > one completes, or when both complete? One possibility is to expose this > as a tunable

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Nicolas Williams
On Thu, Aug 28, 2008 at 11:29:21AM -0500, Bob Friesenhahn wrote: > Which of these do you prefer? > >o System waits substantial time for devices to (possibly) recover in > order to ensure that subsequently written data has the least > chance of being lost. > >o System immediately

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Richard Elling
Bill Sommerfeld wrote: > On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote: > >> A better option would be to not use this to perform FMA diagnosis, but >> instead work into the mirror child selection code. This has already >> been alluded to before, but it would be cool to keep track of lat

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bill Sommerfeld
On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote: > A better option would be to not use this to perform FMA diagnosis, but > instead work into the mirror child selection code. This has already > been alluded to before, but it would be cool to keep track of latency > over time, and use this to

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Miles Nordin wrote: > None of the decisions I described its making based on performance > statistics are ``haywire''---I said it should funnel reads to the > faster side of the mirror, and do this really quickly and > unconservatively. What's your issue with that? >From what

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock
On Thu, Aug 28, 2008 at 08:34:24PM +0100, Ross Smith wrote: > > Personally, if a SATA disk wasn't responding to any requests after 2 > seconds I really don't care if an error has been detected, as far as > I'm concerned that disk is faulty. Unless you have power management enabled, or there's a b

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin
> "bf" == Bob Friesenhahn <[EMAIL PROTECTED]> writes: bf> If the system or device is simply overwelmed with work, then bf> you would not want the system to go haywire and make the bf> problems much worse. None of the decisions I described its making based on performance statistics

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin
> "es" == Eric Schrock <[EMAIL PROTECTED]> writes: es> I don't think you understand how this works. Imagine two es> I/Os, just with different sd timeouts and retry logic - that's es> B_FAILFAST. It's quite simple, and independent of any es> hardware implementation. AIUI the

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Ross Smith
but feel it should have that same approach to management of its drives. However, that said, I'll be more than willing to test the new B_FAILFAST logic on iSCSI once it's released. Just let me know when it's out. Ross > Date: Thu, 28 Aug 2008 11:29:21 -0500 > From: [EMAIL

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Miles Nordin wrote: > > you're right in terms of fixed timeouts, but there's no reason it > can't compare the performance of redundant data sources, and if one > vdev performs an order of magnitude slower than another set of vdevs > with sufficient redundancy, stop issuing read

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock
On Thu, Aug 28, 2008 at 02:17:08PM -0400, Miles Nordin wrote: > > you're right in terms of fixed timeouts, but there's no reason it > can't compare the performance of redundant data sources, and if one > vdev performs an order of magnitude slower than another set of vdevs > with sufficient redunda

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin
> "es" == Eric Schrock <[EMAIL PROTECTED]> writes: es> Finally, imposing additional timeouts in ZFS is a bad idea. es> [...] As such, it doesn't have the necessary context to know es> what constitutes a reasonable timeout. you're right in terms of fixed timeouts, but there's no re

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock
Ross, thanks for the feedback. A couple points here - A lot of work went into improving the error handling around build 77 of Nevada. There are still problems today, but a number of the complaints we've seen are on s10 software or older nevada builds that didn't have these fixes. Anything from

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Ross wrote: > > I believe ZFS should apply the same tough standards to pool > availability as it does to data integrity. A bad checksum makes ZFS > read the data from elsewhere, why shouldn't a timeout do the same > thing? A problem is that for some devices, a five minute

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Ross
Since somebody else has just posted about their entire system locking up when pulling a drive, I thought I'd raise this for discussion. I think Ralf made a very good point in the other thread. ZFS can guarantee data integrity, what it can't do is guarantee data availability. The problem is, t