Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-16 Thread Edward Ned Harvey
> From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
> 
> > raidzN takes a really long time to resilver (code written
> inefficiently,
> > it's a known problem.)  If you had a huge raidz3, it would literally
> never
> > finish, because it couldn't resilver as fast as new data appears.  A
> week
> 
> In what way is the code written inefficiently?

Here is a link to one message in the middle of a really long thread, which
touched on a lot of things, so it's difficult to read the thread now and get
what it all boils down to and which parts are relevant to the present
discussion.  Relevant comments below...
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg41998.html

In conclusion of the referenced thread:

The raidzN resilver code is inefficient, especially when there are a lot of
disks in the vdev, because...

1. It processes one slab at a time.  That's very important.  Each disk
spends a lot of idle time waiting for the next disk to fetch something, so
there is an opportunity to start prefetching data on the idle disks, and
that is not happening.

2. Each slab is spread across many disks, so the average seek time to fetch
the slab approaches the maximum seek time of a single disk.  That means an
average 2x longer than average seek time.

2a. The more disks in the vdev, the smaller the piece of data that gets
written to each individual disk.  So you are waiting for the maximum seek
time, in order to fetch a slab fragment which is tiny ...

3. The order of slab fetching is determined by creation time, not by disk
layout.  This is a huge setback.  It means each seek is essentially random,
which yields maximum seek time, instead of being sequential which approaches
zero seek time.  If you could cut the seek time down to zero, you would have
infinitely faster IOPS.  Something divided by zero is infinity.  Suddenly
you wouldn't care about seek time and you'd start paying attention to some
other limiting factor.
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg42017.html

4. Guess what happens if you have 2 or 3 failed disks in your raidz3, and
they're trying to resilver at the same time.  Does the system ignore
subsequently failed disks and concentrate on restoring a single disk
quickly?  Or does the system try to resilver them all simultaneously and
therefore double or triple the time before any one disk is fully resilvered?

5. If all your files reside in one big raidz3, that means a little piece of
*every* slab in the pool must be on each disk.  We've concluded above that
you are approaching maximum seek time, and now we're also concluding you
must do the maximum number of possible seeks.  If instead, you break your
big raidz3 vdev into 3 raidz1 vdev's, that means each raidz1 vdev will have
approx 33% as many slab pieces on it.  If you need to resilver a disk, even
though you're resilvering approximately the same number of bytes per disk as
you would have in raidz3, in the raidz1 you've cut the number of seeks down
to 33%, and you've reduced the time necessary for each of those seeks.
Still better ... Compare a 23-disk raidz3 (capacity of 20 disks) against 20
mirrors.  Resilver one disk.  You only require 5% as many seeks, and each
seek will go twice as fast.  So the mirror will resilver 40x faster.  Also,
if anybody is actually using the pool during that time, only 5% of the user
operations will result in a seek on the resilvering mirror disk, while 100%
of the user operations will hurt the raidz3 resilver.

6. Please see the following calculation of probability of failure of 20
mirrors vs 23 disk raidz3.  According to my calculations, the probability of
4 disk failure in raidz3 is approx 4.4E-4 and the probability of 2 disks in
the same mirror failing is approx 5E-5.  So the chances of either pool to
fail is very small, but the raidz3 is approx 10x more likely to suffer pool
failure than the mirror setup.  Granted there is some linear estimation
which is not entirely accurate, but I think the calculation comes within an
order of magnitude of being correct.  The mirror setup is 65% more hardware,
10x more reliable, and much faster than the raidz3 setup, same usable
capacity.
http://dl.dropbox.com/u/543241/raidz3%20vs%20mirrors.pdf 

...

Compare the 21disk raidz3 versus 3 vdev's of 7-disk raidz1.  You get more
than 3x faster resilver time with the smaller vdev's, and you only get 3x
the redundancy in the raidz3.  That means the probability of 4
simultaneously failed disks in the raidz3 is higher than the probability of
2 failed disks in a single raidz1 vdev.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-USAS2-L8i

2010-10-16 Thread Alexander Lesle
Hello all,

now I have ordered this controller card from LSI
http://lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/internal/sas9211-8i/index.html

It has the same controller onboard as the Supermicro had.
The card is to plug in the PCI Express 2.0 x8 and the bracket is for
normal cases.
But its _not_ MegaRaid. MegaRaid is not necessary when using ZFS and
here I wont use a hardware raid system. ;-)

-- 
Mit freundlichem Gruss
Regards
Alexander




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-16 Thread Richard Elling
On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote:

> So, what would you suggest, if I wanted to create really big pools? Say in 
> the 100 TB range? That would be quite a number of single drives then, 
> especially when you want to go with zpool raid-1.

For 100 TB, the methods change dramatically.  You can't just reload 100 TB from 
CD
or tape. When you get to this scale you need to be thinking about raidz2+ *and*
mirroring.

I will be exploring these issues of scale at the "Techniques for Managing Huge
Amounts of Data" tutorial at the USENIX LISA '10 Conference.
http://www.usenix.org/events/lisa10/training/
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA '10 Conference November 8-16
ZFS and performance consulting
http://www.RichardElling.com












___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] resilver question

2010-10-16 Thread Roy Sigurd Karlsbakk
Hi all

I'm seeing some rather bad resilver times for a pool of WD Green drives (I 
know, bad drives, but leave that). Does resilver go through the whole pool or 
just the VDEV in question?

-- 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-USAS2-L8i

2010-10-16 Thread Giovanni Tirloni
On Fri, Oct 15, 2010 at 5:18 PM, Maurice Volaski <
maurice.vola...@einstein.yu.edu> wrote:

> The mpt_sas driver supports it. We've had LSI 2004 and 2008 controllers
>> hang
>> for quite some time when used with SuperMicro chassis and Intel X25-E SSDs
>> (OSOL b134 and b147). It seems to be a firmware issue that isn't fixed
>> with
>> the last update.
>>
>
> Do you mean to include all the PCie cards not just the AOC-USAS2-L8i and
> when it's directly connected and not through the backplane? Prior reports
> here seem to be implicating the card only when it was connected to the
> backplane.
>
>
I only tested the LSI 2004/2008 HBAs connected to the backplane (both 3Gb/s
and 6Gb/s).

The MegaRAID ELP, when connected to the same backplane, doesn't exhibit
that behavior.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-16 Thread Pasi Kärkkäinen
On Sat, Oct 16, 2010 at 08:38:28AM -0700, Richard Elling wrote:
>On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote:
> 
>  So, what would you suggest, if I wanted to create really big pools? Say
>  in the 100 TB range? That would be quite a number of single drives then,
>  especially when you want to go with zpool raid-1.
> 
>For 100 TB, the methods change dramatically.  You can't just reload 100 TB
>from CD
>or tape. When you get to this scale you need to be thinking about raidz2+
>*and*
>mirroring.
>I will be exploring these issues of scale at the "Techniques for Managing
>Huge
>Amounts of Data" tutorial at the USENIX LISA '10 Conference.
>[1]http://www.usenix.org/events/lisa10/training/

Hopefully your presentation will be available online after the event!

-- Pasi

> -- richard
> 
>--
>OpenStorage Summit, October 25-27, Palo Alto, CA
>[2]http://nexenta-summit2010.eventbrite.com
>USENIX LISA '10 Conference November 8-16
>ZFS and performance consulting
>[3]http://www.RichardElling.com
> 
> References
> 
>Visible links
>1. http://www.usenix.org/events/lisa10/training/
>2. http://nexenta-summit2010.eventbrite.com/
>3. http://www.richardelling.com/

> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver question

2010-10-16 Thread Roy Sigurd Karlsbakk
- Original Message -
> On 10/17/10 04:54 AM, Roy Sigurd Karlsbakk wrote:
> > Hi all
> >
> > I'm seeing some rather bad resilver times for a pool of WD Green
> > drives (I know, bad drives, but leave that). Does resilver go
> > through the whole pool or just the VDEV in question?
> >
> >
> The vdev only. All the data required to reconstruct a device in a vdev
> is stored on the other devices.

That's what I thought, but then

r...@urd:~# zpool status
  pool: dpool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 8h46m, 2.31% done, 370h47m to go
config:

NAME  STATE READ WRITE CKSUM
dpool ONLINE   0 0 0
  raidz2-0ONLINE   0 0 0
c7t2d0ONLINE   0 0 0
c7t3d0ONLINE   0 0 0
c7t4d0ONLINE   0 0 0
c7t5d0ONLINE   0 0 0
c7t6d0ONLINE   0 0 0
c7t7d0ONLINE   0 0 0
c8t0d0ONLINE   0 0 0
  raidz2-1ONLINE   0 0 0
c8t1d0ONLINE   0 0 0
c8t2d0ONLINE   0 0 0
c8t3d0ONLINE   0 0 0
c8t4d0ONLINE   0 0 0
c8t5d0ONLINE   0 0 0
c8t6d0ONLINE   0 0 0
c8t7d0ONLINE   0 0 0
  raidz2-2ONLINE   0 0 0
c9t0d0ONLINE   0 0 0
c9t1d0ONLINE   0 0 0
c9t2d0ONLINE   0 0 0
c9t3d0ONLINE   0 0 0
spare-4   ONLINE   0 0 0
  c9t4d0  ONLINE   0 0 0
  c9t7d0  ONLINE   0 0 0  43.5G resilvered
c9t5d0ONLINE   0 0 0
c9t6d0ONLINE   0 0 0
  raidz2-4ONLINE   0 0 0
c14t9d0   ONLINE   0 0 0
c14t10d0  ONLINE   0 0 0
c14t11d0  ONLINE   0 0 0
c14t12d0  ONLINE   0 0 0
c14t13d0  ONLINE   0 0 0
c14t14d0  ONLINE   0 0 0
c14t15d0  ONLINE   0 0 0
c14t16d0  ONLINE   0 0 0
c14t17d0  ONLINE   0 0 0
c14t18d0  ONLINE   0 0 0
c14t19d0  ONLINE   0 0 0
c14t20d0  ONLINE   0 0 0
logs
  mirror-3ONLINE   0 0 0
c10d1s0   ONLINE   0 0 0
c11d0s0   ONLINE   0 0 0
cache
  c10d1s1 ONLINE   0 0 0
  c11d0s1 ONLINE   0 0 0
spares
  c9t7d0  INUSE currently in use

errors: No known data errors


-- 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver question

2010-10-16 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk
> 
> > The vdev only. 

Right on.
Furthermore, as shown in the "zpool status," a 7-disk raidz2 is certainly a 
reasonable vdev configuration.


>  scrub: resilver in progress for 8h46m, 2.31% done, 370h47m to go

Ouch.
I'll just say this much:

During the resilver, be sure to disable autosnapshots and scrubs and "zfs 
sends."  Do everything you can to reduce workload on the system.

Would it help to delete old snapshots?  I'm not sure, but I think it probably 
would.

The time to resilver is determined by how many slabs (stripes, blocks, not sure 
if there's a good or correct terminology here) ... how many slabs exist inside 
that vdev.  All 6 good disks will seek & read their piece of the slab, parity 
is calculated and written to the resilvering disk.  Repeat for all slabs in the 
vdev.  

I think if you destroy snaps, it will reduce the number of slabs that need to 
be processed.  In the future, consider using either (a) mirrors instead of 
raidzN, or (b) disks with higher spindle speeds and lower seek times.

If your HBA supports WriteBack, you might improve resilver speed, by enabling 
WB on the disk which is resilvering.  But you should consider that temporary, 
and go back to WriteThrough after the resilver is completed.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-16 Thread Richard Elling
On Oct 16, 2010, at 4:13 PM, Pasi Kärkkäinen wrote:
> On Sat, Oct 16, 2010 at 08:38:28AM -0700, Richard Elling wrote:
>>   On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote:
>> 
>> So, what would you suggest, if I wanted to create really big pools? Say
>> in the 100 TB range? That would be quite a number of single drives then,
>> especially when you want to go with zpool raid-1.
>> 
>>   For 100 TB, the methods change dramatically.  You can't just reload 100 TB
>>   from CD
>>   or tape. When you get to this scale you need to be thinking about raidz2+
>>   *and*
>>   mirroring.
>>   I will be exploring these issues of scale at the "Techniques for Managing
>>   Huge
>>   Amounts of Data" tutorial at the USENIX LISA '10 Conference.
>>   [1]http://www.usenix.org/events/lisa10/training/
> 
> Hopefully your presentation will be available online after the event!

Sure, though I would encourage everyone to attend :-)
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA '10 Conference November 8-16, 2010
ZFS and performance consulting
http://www.RichardElling.com












___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver question

2010-10-16 Thread Richard Elling
On Oct 16, 2010, at 8:54 AM, Roy Sigurd Karlsbakk wrote:
> Hi all
> 
> I'm seeing some rather bad resilver times for a pool of WD Green drives (I 
> know, bad drives, but leave that). Does resilver go through the whole pool or 
> just the VDEV in question?

Resilvers are done in time order.  The metadata is traversed starting with the 
first
txg and moving forward to the current txg.  The good news is that only data is
resilvered. The bad news for HDD fans is that HDDs do not like random workloads.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA '10 Conference November 8-16
ZFS and performance consulting
http://www.RichardElling.com












___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] adding new disks and setting up a raidz2

2010-10-16 Thread Derek G Nokes
I tried using format to format the drive and got the following:

Ready to format.  Formatting cannot be interrupted
and takes 5724 minutes (estimated). Continue? y
Beginning format. The current time is Sat Oct 16 23:58:17 2010

Formatting...
Format failed

Retry of formatting operation without any of the standard
mode selects and ignoring disk's Grown Defects list.  The
disk may be able to be reformatted this way if an earlier
formatting operation was interrupted by a power failure or
SCSI bus reset.  The Grown Defects list will be recreated
by format verification and surface analysis.

Retry format without mode selects and Grown Defects list? y
Formatting...
Illegal request during format
ASC: 0x24   ASCQ: 0x0
Illegal request during format
ASC: 0x24   ASCQ: 0x0
failed

Is there any way for me to determine whether the disk is defective from 
OpenSolaris? I tried rotating the disk to a different bay and the problem moved 
to the new bay (c0t5000C500268D0821d0p0). When I try using format fdisk I get 
an error as well (fdisk: Error in ioctl DKIOCSMBOOT on 
/dev/rdsk/c0t5000C500268D0821d0p0)

I also noticed that the format command take much much longer with this 
particular disk compared to the other 7.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver question

2010-10-16 Thread Ian Collins

On 10/17/10 12:37 PM, Roy Sigurd Karlsbakk wrote:

- Original Message -
   

On 10/17/10 04:54 AM, Roy Sigurd Karlsbakk wrote:
 

Hi all

I'm seeing some rather bad resilver times for a pool of WD Green
drives (I know, bad drives, but leave that). Does resilver go
through the whole pool or just the VDEV in question?


   

The vdev only. All the data required to reconstruct a device in a vdev
is stored on the other devices.
 

That's what I thought, but then

r...@urd:~# zpool status
   pool: dpool
  state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
 continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scrub: resilver in progress for 8h46m, 2.31% done, 370h47m to go
   


I'm not sure what that's supposed to prove.  Run zpool iostat -v to see 
where the activity is.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss