Re: [zfs-discuss] Optimal raidz3 configuration
> From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us] > > > raidzN takes a really long time to resilver (code written > inefficiently, > > it's a known problem.) If you had a huge raidz3, it would literally > never > > finish, because it couldn't resilver as fast as new data appears. A > week > > In what way is the code written inefficiently? Here is a link to one message in the middle of a really long thread, which touched on a lot of things, so it's difficult to read the thread now and get what it all boils down to and which parts are relevant to the present discussion. Relevant comments below... http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg41998.html In conclusion of the referenced thread: The raidzN resilver code is inefficient, especially when there are a lot of disks in the vdev, because... 1. It processes one slab at a time. That's very important. Each disk spends a lot of idle time waiting for the next disk to fetch something, so there is an opportunity to start prefetching data on the idle disks, and that is not happening. 2. Each slab is spread across many disks, so the average seek time to fetch the slab approaches the maximum seek time of a single disk. That means an average 2x longer than average seek time. 2a. The more disks in the vdev, the smaller the piece of data that gets written to each individual disk. So you are waiting for the maximum seek time, in order to fetch a slab fragment which is tiny ... 3. The order of slab fetching is determined by creation time, not by disk layout. This is a huge setback. It means each seek is essentially random, which yields maximum seek time, instead of being sequential which approaches zero seek time. If you could cut the seek time down to zero, you would have infinitely faster IOPS. Something divided by zero is infinity. Suddenly you wouldn't care about seek time and you'd start paying attention to some other limiting factor. http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg42017.html 4. Guess what happens if you have 2 or 3 failed disks in your raidz3, and they're trying to resilver at the same time. Does the system ignore subsequently failed disks and concentrate on restoring a single disk quickly? Or does the system try to resilver them all simultaneously and therefore double or triple the time before any one disk is fully resilvered? 5. If all your files reside in one big raidz3, that means a little piece of *every* slab in the pool must be on each disk. We've concluded above that you are approaching maximum seek time, and now we're also concluding you must do the maximum number of possible seeks. If instead, you break your big raidz3 vdev into 3 raidz1 vdev's, that means each raidz1 vdev will have approx 33% as many slab pieces on it. If you need to resilver a disk, even though you're resilvering approximately the same number of bytes per disk as you would have in raidz3, in the raidz1 you've cut the number of seeks down to 33%, and you've reduced the time necessary for each of those seeks. Still better ... Compare a 23-disk raidz3 (capacity of 20 disks) against 20 mirrors. Resilver one disk. You only require 5% as many seeks, and each seek will go twice as fast. So the mirror will resilver 40x faster. Also, if anybody is actually using the pool during that time, only 5% of the user operations will result in a seek on the resilvering mirror disk, while 100% of the user operations will hurt the raidz3 resilver. 6. Please see the following calculation of probability of failure of 20 mirrors vs 23 disk raidz3. According to my calculations, the probability of 4 disk failure in raidz3 is approx 4.4E-4 and the probability of 2 disks in the same mirror failing is approx 5E-5. So the chances of either pool to fail is very small, but the raidz3 is approx 10x more likely to suffer pool failure than the mirror setup. Granted there is some linear estimation which is not entirely accurate, but I think the calculation comes within an order of magnitude of being correct. The mirror setup is 65% more hardware, 10x more reliable, and much faster than the raidz3 setup, same usable capacity. http://dl.dropbox.com/u/543241/raidz3%20vs%20mirrors.pdf ... Compare the 21disk raidz3 versus 3 vdev's of 7-disk raidz1. You get more than 3x faster resilver time with the smaller vdev's, and you only get 3x the redundancy in the raidz3. That means the probability of 4 simultaneously failed disks in the raidz3 is higher than the probability of 2 failed disks in a single raidz1 vdev. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-USAS2-L8i
Hello all, now I have ordered this controller card from LSI http://lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/internal/sas9211-8i/index.html It has the same controller onboard as the Supermicro had. The card is to plug in the PCI Express 2.0 x8 and the bracket is for normal cases. But its _not_ MegaRaid. MegaRaid is not necessary when using ZFS and here I wont use a hardware raid system. ;-) -- Mit freundlichem Gruss Regards Alexander ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote: > So, what would you suggest, if I wanted to create really big pools? Say in > the 100 TB range? That would be quite a number of single drives then, > especially when you want to go with zpool raid-1. For 100 TB, the methods change dramatically. You can't just reload 100 TB from CD or tape. When you get to this scale you need to be thinking about raidz2+ *and* mirroring. I will be exploring these issues of scale at the "Techniques for Managing Huge Amounts of Data" tutorial at the USENIX LISA '10 Conference. http://www.usenix.org/events/lisa10/training/ -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com USENIX LISA '10 Conference November 8-16 ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] resilver question
Hi all I'm seeing some rather bad resilver times for a pool of WD Green drives (I know, bad drives, but leave that). Does resilver go through the whole pool or just the VDEV in question? -- Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-USAS2-L8i
On Fri, Oct 15, 2010 at 5:18 PM, Maurice Volaski < maurice.vola...@einstein.yu.edu> wrote: > The mpt_sas driver supports it. We've had LSI 2004 and 2008 controllers >> hang >> for quite some time when used with SuperMicro chassis and Intel X25-E SSDs >> (OSOL b134 and b147). It seems to be a firmware issue that isn't fixed >> with >> the last update. >> > > Do you mean to include all the PCie cards not just the AOC-USAS2-L8i and > when it's directly connected and not through the backplane? Prior reports > here seem to be implicating the card only when it was connected to the > backplane. > > I only tested the LSI 2004/2008 HBAs connected to the backplane (both 3Gb/s and 6Gb/s). The MegaRAID ELP, when connected to the same backplane, doesn't exhibit that behavior. -- Giovanni Tirloni gtirl...@sysdroid.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
On Sat, Oct 16, 2010 at 08:38:28AM -0700, Richard Elling wrote: >On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote: > > So, what would you suggest, if I wanted to create really big pools? Say > in the 100 TB range? That would be quite a number of single drives then, > especially when you want to go with zpool raid-1. > >For 100 TB, the methods change dramatically. You can't just reload 100 TB >from CD >or tape. When you get to this scale you need to be thinking about raidz2+ >*and* >mirroring. >I will be exploring these issues of scale at the "Techniques for Managing >Huge >Amounts of Data" tutorial at the USENIX LISA '10 Conference. >[1]http://www.usenix.org/events/lisa10/training/ Hopefully your presentation will be available online after the event! -- Pasi > -- richard > >-- >OpenStorage Summit, October 25-27, Palo Alto, CA >[2]http://nexenta-summit2010.eventbrite.com >USENIX LISA '10 Conference November 8-16 >ZFS and performance consulting >[3]http://www.RichardElling.com > > References > >Visible links >1. http://www.usenix.org/events/lisa10/training/ >2. http://nexenta-summit2010.eventbrite.com/ >3. http://www.richardelling.com/ > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver question
- Original Message - > On 10/17/10 04:54 AM, Roy Sigurd Karlsbakk wrote: > > Hi all > > > > I'm seeing some rather bad resilver times for a pool of WD Green > > drives (I know, bad drives, but leave that). Does resilver go > > through the whole pool or just the VDEV in question? > > > > > The vdev only. All the data required to reconstruct a device in a vdev > is stored on the other devices. That's what I thought, but then r...@urd:~# zpool status pool: dpool state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 8h46m, 2.31% done, 370h47m to go config: NAME STATE READ WRITE CKSUM dpool ONLINE 0 0 0 raidz2-0ONLINE 0 0 0 c7t2d0ONLINE 0 0 0 c7t3d0ONLINE 0 0 0 c7t4d0ONLINE 0 0 0 c7t5d0ONLINE 0 0 0 c7t6d0ONLINE 0 0 0 c7t7d0ONLINE 0 0 0 c8t0d0ONLINE 0 0 0 raidz2-1ONLINE 0 0 0 c8t1d0ONLINE 0 0 0 c8t2d0ONLINE 0 0 0 c8t3d0ONLINE 0 0 0 c8t4d0ONLINE 0 0 0 c8t5d0ONLINE 0 0 0 c8t6d0ONLINE 0 0 0 c8t7d0ONLINE 0 0 0 raidz2-2ONLINE 0 0 0 c9t0d0ONLINE 0 0 0 c9t1d0ONLINE 0 0 0 c9t2d0ONLINE 0 0 0 c9t3d0ONLINE 0 0 0 spare-4 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 43.5G resilvered c9t5d0ONLINE 0 0 0 c9t6d0ONLINE 0 0 0 raidz2-4ONLINE 0 0 0 c14t9d0 ONLINE 0 0 0 c14t10d0 ONLINE 0 0 0 c14t11d0 ONLINE 0 0 0 c14t12d0 ONLINE 0 0 0 c14t13d0 ONLINE 0 0 0 c14t14d0 ONLINE 0 0 0 c14t15d0 ONLINE 0 0 0 c14t16d0 ONLINE 0 0 0 c14t17d0 ONLINE 0 0 0 c14t18d0 ONLINE 0 0 0 c14t19d0 ONLINE 0 0 0 c14t20d0 ONLINE 0 0 0 logs mirror-3ONLINE 0 0 0 c10d1s0 ONLINE 0 0 0 c11d0s0 ONLINE 0 0 0 cache c10d1s1 ONLINE 0 0 0 c11d0s1 ONLINE 0 0 0 spares c9t7d0 INUSE currently in use errors: No known data errors -- Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver question
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk > > > The vdev only. Right on. Furthermore, as shown in the "zpool status," a 7-disk raidz2 is certainly a reasonable vdev configuration. > scrub: resilver in progress for 8h46m, 2.31% done, 370h47m to go Ouch. I'll just say this much: During the resilver, be sure to disable autosnapshots and scrubs and "zfs sends." Do everything you can to reduce workload on the system. Would it help to delete old snapshots? I'm not sure, but I think it probably would. The time to resilver is determined by how many slabs (stripes, blocks, not sure if there's a good or correct terminology here) ... how many slabs exist inside that vdev. All 6 good disks will seek & read their piece of the slab, parity is calculated and written to the resilvering disk. Repeat for all slabs in the vdev. I think if you destroy snaps, it will reduce the number of slabs that need to be processed. In the future, consider using either (a) mirrors instead of raidzN, or (b) disks with higher spindle speeds and lower seek times. If your HBA supports WriteBack, you might improve resilver speed, by enabling WB on the disk which is resilvering. But you should consider that temporary, and go back to WriteThrough after the resilver is completed. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
On Oct 16, 2010, at 4:13 PM, Pasi Kärkkäinen wrote: > On Sat, Oct 16, 2010 at 08:38:28AM -0700, Richard Elling wrote: >> On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote: >> >> So, what would you suggest, if I wanted to create really big pools? Say >> in the 100 TB range? That would be quite a number of single drives then, >> especially when you want to go with zpool raid-1. >> >> For 100 TB, the methods change dramatically. You can't just reload 100 TB >> from CD >> or tape. When you get to this scale you need to be thinking about raidz2+ >> *and* >> mirroring. >> I will be exploring these issues of scale at the "Techniques for Managing >> Huge >> Amounts of Data" tutorial at the USENIX LISA '10 Conference. >> [1]http://www.usenix.org/events/lisa10/training/ > > Hopefully your presentation will be available online after the event! Sure, though I would encourage everyone to attend :-) -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com USENIX LISA '10 Conference November 8-16, 2010 ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver question
On Oct 16, 2010, at 8:54 AM, Roy Sigurd Karlsbakk wrote: > Hi all > > I'm seeing some rather bad resilver times for a pool of WD Green drives (I > know, bad drives, but leave that). Does resilver go through the whole pool or > just the VDEV in question? Resilvers are done in time order. The metadata is traversed starting with the first txg and moving forward to the current txg. The good news is that only data is resilvered. The bad news for HDD fans is that HDDs do not like random workloads. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com USENIX LISA '10 Conference November 8-16 ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] adding new disks and setting up a raidz2
I tried using format to format the drive and got the following: Ready to format. Formatting cannot be interrupted and takes 5724 minutes (estimated). Continue? y Beginning format. The current time is Sat Oct 16 23:58:17 2010 Formatting... Format failed Retry of formatting operation without any of the standard mode selects and ignoring disk's Grown Defects list. The disk may be able to be reformatted this way if an earlier formatting operation was interrupted by a power failure or SCSI bus reset. The Grown Defects list will be recreated by format verification and surface analysis. Retry format without mode selects and Grown Defects list? y Formatting... Illegal request during format ASC: 0x24 ASCQ: 0x0 Illegal request during format ASC: 0x24 ASCQ: 0x0 failed Is there any way for me to determine whether the disk is defective from OpenSolaris? I tried rotating the disk to a different bay and the problem moved to the new bay (c0t5000C500268D0821d0p0). When I try using format fdisk I get an error as well (fdisk: Error in ioctl DKIOCSMBOOT on /dev/rdsk/c0t5000C500268D0821d0p0) I also noticed that the format command take much much longer with this particular disk compared to the other 7. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver question
On 10/17/10 12:37 PM, Roy Sigurd Karlsbakk wrote: - Original Message - On 10/17/10 04:54 AM, Roy Sigurd Karlsbakk wrote: Hi all I'm seeing some rather bad resilver times for a pool of WD Green drives (I know, bad drives, but leave that). Does resilver go through the whole pool or just the VDEV in question? The vdev only. All the data required to reconstruct a device in a vdev is stored on the other devices. That's what I thought, but then r...@urd:~# zpool status pool: dpool state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 8h46m, 2.31% done, 370h47m to go I'm not sure what that's supposed to prove. Run zpool iostat -v to see where the activity is. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss