Re: [zfs-discuss] ZFS & ZPOOL => trash
Well, if this is not a root disk and the server boots at least to single-user, as you wrote above, you can try to disable auto-import of this pool. Easiest of all is to disable auto-imports of all pools by removing or renaming the file /etc/zfs/zpool.cache - it is a list of known pools for automatic import. Without it, only your root pool will be imported, and all other pools (those without problems) must be re-imported and re-cached by ZFS into this file. Then your server will work (except the pool and local zone in it), and you can go on about fixing it. Did you already try the "zpool import -F" command? Good luck, //Jim -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On 05/15/2011 09:58 PM, Richard Elling wrote: In one of my systems, I have 1TB mirrors, 70% full, which can be sequentially completely read/written in 2 hrs. But the resilver took 12 hours of idle time. Supposing you had a 70% full pool of raidz3, 2TB disks, using 10 disks + 3 parity, and a usage pattern similar to mine, your resilver time would have been minimum 10 days, bollix likely approaching 20 or 30 days. (Because you wouldn't get 2-3 weeks of consecutive idle time, and the random access time for a raidz approaches 2x the random access time of a mirror.) totally untrue BTW, the reason I chose 10+3 disks above was just because it makes calculation easy. It's easy to multiply by 10. I'm not suggesting using that configuration. You may notice that I don't recommend raidz for most situations. I endorse mirrors because they minimize resilver time (and maximize performance in general). Resilver time is a problem for ZFS, which they may fix someday. Resilver time is not a significant problem with ZFS. Resilver time is a much bigger problem with traditional RAID systems. In any case, it is bad systems engineering to optimize a system for best resilver time. -- richard Actually I have seen resilvers take a very long time (weeks) on solaris/raidz2 when I almost never see a hardware raid controller take more than a day or two. In one case i thrashed the disks absolutely as hard as I could (hardware controller) and finally was able to get the rebuild to take almost 1 week.. Here is an example of one right now: pool: raid3060 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go config: ZFS resilver can take a very long time depending on your usage pattern. I do disagree with some things he said though... like a 1TB drive being able to be read/written in 2 hours? I seriously doubt this. Just reading 1 TB in 2 hours means an average speed of over 130 megabytes/sec. Only really new 1TB drives will even hit that type of speed at the begging of the drive and the average would be much closer to around 100 MB/sec at the end of the drive. Also that is best case scenario. I know 1TB drives (when they first came out) took aound 4-5 hours to do a complete read of all data on the disk at full speed. Definitely no way to be that fast with reading *and* writing 1TB of data to the drive. I guess if you count reading from one and writing to the other. 3 hours is a much more likely figure and best case. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 9:02 AM, Sandon Van Ness wrote: > > Actually I have seen resilvers take a very long time (weeks) on > solaris/raidz2 when I almost never see a hardware raid controller take more > than a day or two. In one case i thrashed the disks absolutely as hard as I > could (hardware controller) and finally was able to get the rebuild to take > almost 1 week.. Here is an example of one right now: > > pool: raid3060 > state: ONLINE > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go > config: > > Resilver has been a problem with RAIDZ volumes for a while. I've routinely seen it take >300 hours and sometimes >600 hours with 13TB pools at 80%. All disks are maxed out on IOPS while still reading 1-2MB/s and there rarely is any writes. I've written about it before here (and provided data). My only guess is that fragmentation is a real problem in a scrub/resilver situation but whenever the conversation changes to point weaknesses in ZFS we start seeing "that is not a problem" comments. With the 7000s appliance I've heard that the 900hr estimated resilver time was "normal" and "everything is working as expected". Can't help but think there is some walled garden syndrome floating around. -- Giovanni Tirloni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
I have to agree. ZFS needs a more intelligent scrub/resilver algorithm, which can 'sequentialise' the process. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. Giovanni Tirloni wrote: On Mon, May 16, 2011 at 9:02 AM, Sandon Van Ness wrote: Actually I have seen resilvers take a very long time (weeks) on solaris/raidz2 when I almost never see a hardware raid controller take more than a day or two. In one case i thrashed the disks absolutely as hard as I could (hardware controller) and finally was able to get the rebuild to take almost 1 week.. Here is an example of one right now: pool: raid3060 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go config: Resilver has been a problem with RAIDZ volumes for a while. I've routinely seen it take >300 hours and sometimes >600 hours with 13TB pools at 80%. All disks are maxed out on IOPS while still reading 1-2MB/s and there rarely is any writes. I've written about it before here (and provided data). My only guess is that fragmentation is a real problem in a scrub/resilver situation but whenever the conversation changes to point weaknesses in ZFS we start seeing "that is not a problem" comments. With the 7000s appliance I've heard that the 900hr estimated resilver time was "normal" and "everything is working as expected". Can't help but think there is some walled garden syndrome floating around. -- Giovanni Tirloni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
> From: Richard Elling [mailto:richard.ell...@gmail.com] > > > In one of my systems, I have 1TB mirrors, 70% full, which can be > > sequentially completely read/written in 2 hrs. But the resilver took 12 > > hours of idle time. Supposing you had a 70% full pool of raidz3, 2TB disks, > > using 10 disks + 3 parity, and a usage pattern similar to mine, your > > resilver time would have been minimum 10 days, > > bollix > > Resilver time is not a significant problem with ZFS. Resilver time is a much > bigger problem with traditional RAID systems. In any case, it is bad systems > engineering to optimize a system for best resilver time. Because RE seems to be emotionally involved with ZFS resilver times, I don't believe it's going to be productive for me to try addressing his off-hand comments. Instead, I'm only going to say this much: In my system mentioned above, a complete disk can be copied to another complete disk, sequentially, in 131 minutes. But during idle time it took 12 hours because ZFS resilver only does the used parts of disk, in essentially random order. So ZFS resilver often takes many times longer than a complete hardware-based complete disk resilver. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
> From: Sandon Van Ness [mailto:san...@van-ness.com] > > ZFS resilver can take a very long time depending on your usage pattern. > I do disagree with some things he said though... like a 1TB drive being > able to be read/written in 2 hours? I seriously doubt this. Just reading > 1 TB in 2 hours means an average speed of over 130 megabytes/sec. 1Gbit/sec sustainable sequential disk speed is not uncommon these days, and it is in fact the performance of the disks in the system in question. SATA 7.2krpm disks... Not even special disks. Just typical boring normal disks. > Definitely no way to be that fast with reading *and* writing 1TB of data > to the drive. I guess if you count reading from one and writing to the > other. 3 hours is a much more likely figure and best case. No need to read & write from the same drive. You can read from one drive and write to the other simultaneously at full speed. If there is any performance difference between read & write on these drives, it's not measurable. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely slow zpool scrub performance
> Can you share your 'zpool status' output for both pools? Faster, smaller server: ~# zpool status pool0 pool: pool0 state: ONLINE scan: scrub repaired 0 in 2h18m with 0 errors on Sat May 14 13:28:58 2011 Much larger, more capable server: ~# zpool status pool0 | head pool: pool0 state: ONLINE scan: scrub in progress since Fri May 13 14:04:46 2011 173G scanned out of 14.2T at 737K/s, (scan is slow, no estimated time) 43K repaired, 1.19% done The only other relevant line is: c5t9d0 ONLINE 0 0 0 (repairing) (That's new as of this morning- though it was still very slow before that) > Also you may want to run the following a few times in a loop and > provide the output: > > # echo "::walk spa | ::print spa_t spa_name spa_last_io > spa_scrub_inflight" | mdb -k ~# echo "::walk spa | ::print spa_t spa_name spa_last_io > spa_scrub_inflight" | mdb -k spa_name = [ "pool0" ] spa_last_io = 0x159b275a spa_name = [ "rpool" ] spa_last_io = 0x159b210a mdb: failed to dereference symbol: unknown symbol name I'm pretty sure that's not the output you were looking for :) On the same theme- is there a good reference for all of the various ZFS debugging commands and mdb options? I'd love to spend a lot of time just looking at the data available to me but every time I turn around someone suggests a new and interesting mdb query I've never seen before. Thanks, -Don ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On May 16, 2011, at 5:02 AM, Sandon Van Ness wrote: > On 05/15/2011 09:58 PM, Richard Elling wrote: >>> In one of my systems, I have 1TB mirrors, 70% full, which can be >>> sequentially completely read/written in 2 hrs. But the resilver took 12 >>> hours of idle time. Supposing you had a 70% full pool of raidz3, 2TB disks, >>> using 10 disks + 3 parity, and a usage pattern similar to mine, your >>> resilver time would have been minimum 10 days, >> bollix >> >>> likely approaching 20 or 30 >>> days. (Because you wouldn't get 2-3 weeks of consecutive idle time, and the >>> random access time for a raidz approaches 2x the random access time of a >>> mirror.) >> totally untrue >> >>> BTW, the reason I chose 10+3 disks above was just because it makes >>> calculation easy. It's easy to multiply by 10. I'm not suggesting using >>> that configuration. You may notice that I don't recommend raidz for most >>> situations. I endorse mirrors because they minimize resilver time (and >>> maximize performance in general). Resilver time is a problem for ZFS, which >>> they may fix someday. >> Resilver time is not a significant problem with ZFS. Resilver time is a much >> bigger problem with traditional RAID systems. In any case, it is bad systems >> engineering to optimize a system for best resilver time. >> -- richard > > Actually I have seen resilvers take a very long time (weeks) on > solaris/raidz2 when I almost never see a hardware raid controller take more > than a day or two. In one case i thrashed the disks absolutely as hard as I > could (hardware controller) and finally was able to get the rebuild to take > almost 1 week.. Here is an example of one right now: > > pool: raid3060 > state: ONLINE > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go > config: I have seen worse cases, but the root cause was hardware failures that are not reported by zpool status. Have you checked the health of the disk transports? Hint: fmdump -e Also, what zpool version is this? There were improvements made in the prefetch and the introduction of throttles last year. One makes it faster, the other intentionally slows it down. As a rule of thumb, the resilvering disk is expected to max out at around 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect the throttles or broken data path. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
following are some thoughts if it's not too late: > 1 SuperMicro 847E1-R1400LPB I guess you meant the 847E1[b]6[/b]-R1400LPB, the SAS1 version makes no sense > 1 SuperMicro H8DG6-F not the best choice, see below why > 171 Hitachi 7K3000 3TB I'd go for the more environmentally friendly Ultrastar 5K3000 version - with that many drives you wont mind the slower rotation but WILL notice a difference in power and cooling cost > 1 LSI SAS 9202-16e this is really only a very expensive gadget to be honest, there's really no point to it - especially true when you start looking for the necessary cables that use a connector who's still in "draft" specification... stick to the excellent LSI SAS9200-8e, of which you will need at least 3 in your setup, one to connect each of the 3 JBODS - with them filled with fast drives like you chose, you will need two links (one for the front and one for the back backplane) as daisychainig the backplanes together would oversaturate a single link. if you'd want to take advantage of the dual expanders on your JBOD backplanes for additional redundancy in case of expander or controller failure, you will need 6 of those LSI SAS9200-8e - this is where your board isn't ideal as it has a 3/1/2 PCIe x16/x8/x4 configuration while you'd need 6 PCIe x8 - something the X8DTH-6F will provide, as well as the onboard LSI SAS2008 based HBA for the two backplanes in the server case. > 1 LSI SAS 9211-4i > 2 OCZ 64GB SSD Vertex 3 > 2 OCZ 256GB SSD Vertex 3 if these are meant to be connected together and used as ZIL+L2ARC, then I'd STRONGLY urge you to get the following instead: 1x LSI MegaRAID SAS 9265-8i 1x LSI FastPath licence 4-8x 120GB or 240GB Vertex 3 Max IOPS Edition, whatever suits the budget this solution allows you to push around 400k IOPS to the cache, more than likely way more than the stated appication of the system will need > 1 NeterionX3120SR0001 I don't know this card personally but since it's not listed as supported (http://www.sun.com/io_technologies/nic/NIC1.html) I'd be careful > My question is what is the optimum way of dividing > these drives across vdevs? I would do 14 x 12 drive raidz2 + 3 spare = 140*3TB = ~382TiB usable this would allow for a logical mapping of drives to vdevs, giving you in each case 2 vdevs in the front and 1 in the back with the 9 drive blocks in the back of the JBODs used as 3 x 4/4/1, giving the remaining 2 x 12 drive vdevs plus one spare per case > I could also go with 2TB drives and add an extra 45 > JBOD chassis. This would significantly decrease cost, > but I'm running a gauntlet by getting very close to > minimum useable space. > > 12 x 18 drive raidz2 I would never do vdevs that large, it's just an accident waiting to happen! hopefully these recommendations help you with your project. in any case, it's huge - the biggest system I worked on (which I actually have at home, go figure) only has a bit over 100TB in the following configuration: 6 x 12 drive raidz2 of Hitachi 5K3000 2TB 3 Norco 4224 with a HP SAS Expander in each Supermicro X8DTi-LN4F with 3x LSI SAS9200-8e so yeah, I based my thoughts on my own system but considering that it's been running smoothly for a while now (and that I had a very similar setup with smaller drives and older controllers before), I'm confident in my suggestions Regards from Switzerland, voyman -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS & ZPOOL => trash
All these zpool corrupted are the root of local zones -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Sat, May 14, 2011 at 11:20 PM, John Doe wrote: >> 171 Hitachi 7K3000 3TB > I'd go for the more environmentally friendly Ultrastar 5K3000 version - with > that many drives you wont mind the slower rotation but WILL notice a > difference in power and cooling cost A word of caution - The Hitachi Deskstar 5K3000 drives in 1TB and 2TB are different than the 3TB. The 1TB and 2TB are manufactured in China, and have a very high failure and DOA rate according to Newegg. The 3TB drives come off the same production line as the Ultrastar 5K3000 in Thailand and may be more reliable. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely slow zpool scrub performance
Don, Can you send the entire 'zpool status' output? I wanted to see your pool configuration. Also run the mdb command in a loop (at least 5 tiimes) so we can see if spa_last_io is changing. I'm surprised you're not finding the symbol for 'spa_scrub_inflight' too. Can you check that you didn't mistype this? Thanks, George On Mon, May 16, 2011 at 7:41 AM, Donald Stahl wrote: >> Can you share your 'zpool status' output for both pools? > Faster, smaller server: > ~# zpool status pool0 > pool: pool0 > state: ONLINE > scan: scrub repaired 0 in 2h18m with 0 errors on Sat May 14 13:28:58 2011 > > Much larger, more capable server: > ~# zpool status pool0 | head > pool: pool0 > state: ONLINE > scan: scrub in progress since Fri May 13 14:04:46 2011 > 173G scanned out of 14.2T at 737K/s, (scan is slow, no estimated time) > 43K repaired, 1.19% done > > The only other relevant line is: > c5t9d0 ONLINE 0 0 0 (repairing) > > (That's new as of this morning- though it was still very slow before that) > >> Also you may want to run the following a few times in a loop and >> provide the output: >> >> # echo "::walk spa | ::print spa_t spa_name spa_last_io >> spa_scrub_inflight" | mdb -k > ~# echo "::walk spa | ::print spa_t spa_name spa_last_io >> spa_scrub_inflight" | mdb -k > spa_name = [ "pool0" ] > spa_last_io = 0x159b275a > spa_name = [ "rpool" ] > spa_last_io = 0x159b210a > mdb: failed to dereference symbol: unknown symbol name > > I'm pretty sure that's not the output you were looking for :) > > On the same theme- is there a good reference for all of the various > ZFS debugging commands and mdb options? > > I'd love to spend a lot of time just looking at the data available to > me but every time I turn around someone suggests a new and interesting > mdb query I've never seen before. > > Thanks, > -Don > -- George Wilson M: +1.770.853.8523 F: +1.650.494.1676 275 Middlefield Road, Suite 50 Menlo Park, CA 94025 http://www.delphix.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 8:33 AM, Richard Elling wrote: > As a rule of thumb, the resilvering disk is expected to max out at around > 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect > the throttles or broken data path. My system was doing far less than 80 IOPS during resilver when I recently upgraded the drives. The older and newer drives were both 5k RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to be super fast. The worst resilver was 50 hours, the best was about 20 hours. This was just my home server, which is lightly used. The clients (2-3 CIFS clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS clients) are mostly idle and don't do a lot of writes. Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things up a bit, which suggests that the default values may be too conservative for some environments. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely slow zpool scrub performance
> Can you send the entire 'zpool status' output? I wanted to see your > pool configuration. Also run the mdb command in a loop (at least 5 > tiimes) so we can see if spa_last_io is changing. I'm surprised you're > not finding the symbol for 'spa_scrub_inflight' too. Can you check > that you didn't mistype this? I copy and pasted to make sure that wasn't the issue :) I will run it in a loop this time. I didn't do it last time because of the error. This box was running only raidz sets originally. After running into performance problems we added a bunch of mirrors to try to improve the iops. The logs are not mirrored right now as we were testing adding the other two as cache disks to see if that helped. We've also tested using a ramdisk ZIL to see if that made any difference- it did not. The performance on this box was excellent until it started to fill up (somewhere around 70%)- then performance degraded significantly. We added more disks, and copied the data around to rebalance things. It seems to have helped somewhat- but it is nothing like when we first created the array. config: NAMESTATE READ WRITE CKSUM pool0 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c5t8d0 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 c5t9d0 ONLINE 0 0 0 (repairing) c5t10d0 ONLINE 0 0 0 c5t11d0 ONLINE 0 0 0 c5t12d0 ONLINE 0 0 0 raidz1-2 ONLINE 0 0 0 c5t13d0 ONLINE 0 0 0 c5t14d0 ONLINE 0 0 0 c5t15d0 ONLINE 0 0 0 c5t16d0 ONLINE 0 0 0 raidz1-3 ONLINE 0 0 0 c5t21d0 ONLINE 0 0 0 c5t22d0 ONLINE 0 0 0 c5t23d0 ONLINE 0 0 0 c5t24d0 ONLINE 0 0 0 raidz1-4 ONLINE 0 0 0 c5t25d0 ONLINE 0 0 0 c5t26d0 ONLINE 0 0 0 c5t27d0 ONLINE 0 0 0 c5t28d0 ONLINE 0 0 0 raidz1-5 ONLINE 0 0 0 c5t29d0 ONLINE 0 0 0 c5t30d0 ONLINE 0 0 0 c5t31d0 ONLINE 0 0 0 c5t32d0 ONLINE 0 0 0 raidz1-6 ONLINE 0 0 0 c5t33d0 ONLINE 0 0 0 c5t34d0 ONLINE 0 0 0 c5t35d0 ONLINE 0 0 0 c5t36d0 ONLINE 0 0 0 raidz1-7 ONLINE 0 0 0 c5t37d0 ONLINE 0 0 0 c5t38d0 ONLINE 0 0 0 c5t39d0 ONLINE 0 0 0 c5t40d0 ONLINE 0 0 0 raidz1-8 ONLINE 0 0 0 c5t41d0 ONLINE 0 0 0 c5t42d0 ONLINE 0 0 0 c5t43d0 ONLINE 0 0 0 c5t44d0 ONLINE 0 0 0 raidz1-10 ONLINE 0 0 0 c5t45d0 ONLINE 0 0 0 c5t46d0 ONLINE 0 0 0 c5t47d0 ONLINE 0 0 0 c5t48d0 ONLINE 0 0 0 raidz1-11 ONLINE 0 0 0 c5t49d0 ONLINE 0 0 0 c5t50d0 ONLINE 0 0 0 c5t51d0 ONLINE 0 0 0 c5t52d0 ONLINE 0 0 0 raidz1-12 ONLINE 0 0 0 c5t53d0 ONLINE 0 0 0 c5t54d0 ONLINE 0 0 0 c5t55d0 ONLINE 0 0 0 c5t56d0 ONLINE 0 0 0 raidz1-13 ONLINE 0 0 0 c5t57d0 ONLINE 0 0 0 c5t58d0 ONLINE 0 0 0 c5t59d0 ONLINE 0 0 0 c5t60d0 ONLINE 0 0 0 raidz1-14 ONLINE 0 0 0 c5t61d0 ONLINE 0 0 0 c5t62d0 ONLINE 0 0 0 c5t63d0 ONLI
Re: [zfs-discuss] Extremely slow zpool scrub performance
> I copy and pasted to make sure that wasn't the issue :) Which, ironically, turned out to be the problem- there was an extra carriage return in there that mdb did not like: Here is the output: spa_name = [ "pool0" ] spa_last_io = 0x82721a4 spa_scrub_inflight = 0x1 spa_name = [ "pool0" ] spa_last_io = 0x8272240 spa_scrub_inflight = 0x1 spa_name = [ "pool0" ] spa_last_io = 0x82722f0 spa_scrub_inflight = 0x1 spa_name = [ "pool0" ] spa_last_io = 0x827239e spa_scrub_inflight = 0 spa_name = [ "pool0" ] spa_last_io = 0x8272441 spa_scrub_inflight = 0x1 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely slow zpool scrub performance
Here is another example of the performance problems I am seeing: ~# dd if=/dev/zero of=/pool0/ds.test bs=1024k count=2000 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 56.2184 s, 37.3 MB/s 37MB/s seems like some sort of bad joke for all these disks. I can write the same amount of data to a set of 6 SAS disks on a Dell PERC6/i at a rate of 160MB/s and those disks are hosting 25 vm's and a lot more IOPS than this box. zpool iostat during the same time shows: pool0 14.2T 25.3T124 1.30K 981K 4.02M pool0 14.2T 25.3T277914 2.16M 23.2M pool0 14.2T 25.3T 65 4.03K 526K 90.2M pool0 14.2T 25.3T 18 1.76K 136K 6.81M pool0 14.2T 25.3T460 5.55K 3.60M 111M pool0 14.2T 25.3T160 0 1.24M 0 pool0 14.2T 25.3T182 2.34K 1.41M 33.3M The zero's and other low numbers don't make any sense. And as I mentioned- the busy percent and service times of these disks are never abnormally high- especially when compared to the much smaller, better performing pool I have. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On May 16, 2011, at 10:31 AM, Brandon High wrote: > On Mon, May 16, 2011 at 8:33 AM, Richard Elling > wrote: >> As a rule of thumb, the resilvering disk is expected to max out at around >> 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect >> the throttles or broken data path. > > My system was doing far less than 80 IOPS during resilver when I > recently upgraded the drives. The older and newer drives were both 5k > RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to > be super fast. > > The worst resilver was 50 hours, the best was about 20 hours. This was > just my home server, which is lightly used. The clients (2-3 CIFS > clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS > clients) are mostly idle and don't do a lot of writes. > > Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things > up a bit, which suggests that the default values may be too > conservative for some environments. I am more inclined to change the hires_tick value. The "delays" are in units of clock ticks. For Solaris, the default clock tick is 10ms, that I will argue is too large for modern disk systems. What this means is that when the resilver, scrub, or memory throttle causes delays, the effective IOPS is driven to 10 or less. Unfortunately, these values are guesses and are probably suboptimal for various use cases. OTOH, the prior behaviour of no resilver or scrub throttle was also considered a bad thing. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely slow zpool scrub performance
You mentioned that the pool was somewhat full, can you send the output of 'zpool iostat -v pool0'? You can also try doing the following to reduce 'metaslab_min_alloc_size' to 4K: echo "metaslab_min_alloc_size/Z 1000" | mdb -kw NOTE: This will change the running system so you may want to make this change during off-peak hours. Then check your performance and see if it makes a difference. - George On Mon, May 16, 2011 at 10:58 AM, Donald Stahl wrote: > Here is another example of the performance problems I am seeing: > > ~# dd if=/dev/zero of=/pool0/ds.test bs=1024k count=2000 2000+0 records in > 2000+0 records out > 2097152000 bytes (2.1 GB) copied, 56.2184 s, 37.3 MB/s > > 37MB/s seems like some sort of bad joke for all these disks. I can > write the same amount of data to a set of 6 SAS disks on a Dell > PERC6/i at a rate of 160MB/s and those disks are hosting 25 vm's and a > lot more IOPS than this box. > > zpool iostat during the same time shows: > pool0 14.2T 25.3T 124 1.30K 981K 4.02M > pool0 14.2T 25.3T 277 914 2.16M 23.2M > pool0 14.2T 25.3T 65 4.03K 526K 90.2M > pool0 14.2T 25.3T 18 1.76K 136K 6.81M > pool0 14.2T 25.3T 460 5.55K 3.60M 111M > pool0 14.2T 25.3T 160 0 1.24M 0 > pool0 14.2T 25.3T 182 2.34K 1.41M 33.3M > > The zero's and other low numbers don't make any sense. And as I > mentioned- the busy percent and service times of these disks are never > abnormally high- especially when compared to the much smaller, better > performing pool I have. > -- George Wilson M: +1.770.853.8523 F: +1.650.494.1676 275 Middlefield Road, Suite 50 Menlo Park, CA 94025 http://www.delphix.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 1:20 PM, Brandon High wrote: > The 1TB and 2TB are manufactured in China, and have a very high > failure and DOA rate according to Newegg. > > The 3TB drives come off the same production line as the Ultrastar > 5K3000 in Thailand and may be more reliable. Thanks for the heads up, I was thinking about 5K3000s to finish out my build (currently have Barracuda LPs). I do wonder how much of that DOA is due to newegg HDD packaging/shipping, however. --khd ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely slow zpool scrub performance
> You mentioned that the pool was somewhat full, can you send the output > of 'zpool iostat -v pool0'? ~# zpool iostat -v pool0 capacity operationsbandwidth poolalloc free read write read write -- - - - - - - pool0 14.1T 25.4T926 2.35K 7.20M 15.7M raidz1 673G 439G 42117 335K 790K c5t5d0 - - 20 20 167K 273K c5t6d0 - - 20 20 167K 272K c5t7d0 - - 20 20 167K 273K c5t8d0 - - 20 20 167K 272K raidz1 710G 402G 38 84 309K 546K c5t9d0 - - 18 16 158K 189K c5t10d0 - - 18 16 157K 187K c5t11d0 - - 18 16 158K 189K c5t12d0 - - 18 16 157K 187K raidz1 719G 393G 43 95 348K 648K c5t13d0 - - 20 17 172K 224K c5t14d0 - - 20 17 171K 223K c5t15d0 - - 20 17 172K 224K c5t16d0 - - 20 17 172K 223K raidz1 721G 391G 42 96 341K 653K c5t21d0 - - 20 16 170K 226K c5t22d0 - - 20 16 169K 224K c5t23d0 - - 20 16 170K 226K c5t24d0 - - 20 16 170K 224K raidz1 721G 391G 43100 342K 667K c5t25d0 - - 20 17 172K 231K c5t26d0 - - 20 17 172K 229K c5t27d0 - - 20 17 172K 231K c5t28d0 - - 20 17 172K 229K raidz1 721G 391G 43101 341K 672K c5t29d0 - - 20 18 173K 233K c5t30d0 - - 20 18 173K 231K c5t31d0 - - 20 18 173K 233K c5t32d0 - - 20 18 173K 231K raidz1 722G 390G 42100 339K 667K c5t33d0 - - 20 19 171K 231K c5t34d0 - - 20 19 172K 229K c5t35d0 - - 20 19 171K 231K c5t36d0 - - 20 19 171K 229K raidz1 709G 403G 42107 341K 714K c5t37d0 - - 20 20 171K 247K c5t38d0 - - 20 19 170K 245K c5t39d0 - - 20 20 171K 247K c5t40d0 - - 20 19 170K 245K raidz1 744G 368G 39 79 316K 530K c5t41d0 - - 18 16 163K 183K c5t42d0 - - 18 15 163K 182K c5t43d0 - - 18 16 163K 183K c5t44d0 - - 18 15 163K 182K raidz1 737G 375G 44 98 355K 668K c5t45d0 - - 21 18 178K 231K c5t46d0 - - 21 18 178K 229K c5t47d0 - - 21 18 178K 231K c5t48d0 - - 21 18 178K 229K raidz1 733G 379G 43103 344K 683K c5t49d0 - - 20 19 175K 237K c5t50d0 - - 20 19 175K 235K c5t51d0 - - 20 19 175K 237K c5t52d0 - - 20 19 175K 235K raidz1 732G 380G 43104 344K 685K c5t53d0 - - 20 19 176K 237K c5t54d0 - - 20 19 175K 235K c5t55d0 - - 20 19 175K 237K c5t56d0 - - 20 19 175K 235K raidz1 733G 379G 43101 344K 672K c5t57d0 - - 20 17 175K 233K c5t58d0 - - 20 17 174K 231K c5t59d0 - - 20 17 175K 233K c5t60d0 - - 20 17 174K 231K raidz1 806G 1.38T 50123 401K 817K c5t61d0 - - 24 22 201K 283K c5t62d0 - - 24 22 201K 281K c5t63d0 - - 24 22 201K 283K c5t64d0 - - 24 22 201K 281K raidz1 794G 1.40T 47120 377K 786K c5t65d0 - - 22 23 194K 272K c5t66d0 - - 22 23 194K 270K c5t67d0 - - 22 23 194K 272K c5t68d0 - - 22 23 194K 270K raidz1 788G 1.40T 47115 376
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 1:20 PM, Brandon High wrote: > The 1TB and 2TB are manufactured in China, and have a very high > failure and DOA rate according to Newegg. All drives have a very high DOA rate according to Newegg. The way they package drives for shipping is exactly how Seagate specifically says NOT to pack them here http://www.seagate.com/ww/v/index.jsp?locale=en-US&name=what-to-pack&vgnextoid=5c3a8bc90bf03210VgnVCM101a48090aRCRD I have stopped buying drives (and everything else) from Newegg as they cannot be bothered to properly pack items. It is worth the extra $5 per drive to buy them from CDW (who uses factory approved packaging). Note that I made this change 5 or so years ago and Newegg may have changed their packaging since then. What Newegg was doing is buying drives in the 20-pack from the manufacturer and packing them individually WRAPPED IN BUBBLE WRAP and then stuffed in a box. No clamshell. I realized *something* was up when _every_ drive I looked at had a much higher report of DOA (or early failure) at the Newegg reviews than made any sense (and compared to other site's reviews). This is NOT to say that the drives in question really don't have a QC issue, just that the reports via Newegg are biased by Newegg's packing / shipping practices. -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 2:29 PM, Paul Kraus wrote: > What Newegg was doing is buying drives in the 20-pack from the > manufacturer and packing them individually WRAPPED IN BUBBLE WRAP and > then stuffed in a box. No clamshell. I realized *something* was up > when _every_ drive I looked at had a much higher report of DOA (or > early failure) at the Newegg reviews than made any sense (and compared > to other site's reviews). I picked up a single 5K3000 last week, have not powered it on yet, but it came in a pseudo-OEM box with clamshells. I remember getting bubble-wrapped single drives from Newegg, and more than a fair share of those drives suffered early deaths or never powered on in the first place. No complaints about Amazon: Seagate drives came in Seagate OEM boxes with free shipping via Prime. (probably not practical for you enterprise/professional guys, but nice for home users). An order of 6 the 5K3000 drives for work-related purposes shipped in a Styrofoam holder of sorts that was cut in half for my small number of drives (is this what 20 pks come in?). No idea what other packaging was around them (shipping and receiving opened the packages). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
Actually it is 100 or less, i.e. a 10 msec delay. -- Garrett D'Amore On May 16, 2011, at 11:13 AM, "Richard Elling" wrote: > On May 16, 2011, at 10:31 AM, Brandon High wrote: >> On Mon, May 16, 2011 at 8:33 AM, Richard Elling >> wrote: >>> As a rule of thumb, the resilvering disk is expected to max out at around >>> 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect >>> the throttles or broken data path. >> >> My system was doing far less than 80 IOPS during resilver when I >> recently upgraded the drives. The older and newer drives were both 5k >> RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to >> be super fast. >> >> The worst resilver was 50 hours, the best was about 20 hours. This was >> just my home server, which is lightly used. The clients (2-3 CIFS >> clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS >> clients) are mostly idle and don't do a lot of writes. >> >> Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things >> up a bit, which suggests that the default values may be too >> conservative for some environments. > > I am more inclined to change the hires_tick value. The "delays" are in > units of clock ticks. For Solaris, the default clock tick is 10ms, that I will > argue is too large for modern disk systems. What this means is that when > the resilver, scrub, or memory throttle causes delays, the effective IOPS is > driven to 10 or less. Unfortunately, these values are guesses and are > probably suboptimal for various use cases. OTOH, the prior behaviour of > no resilver or scrub throttle was also considered a bad thing. > -- richard > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 2:35 PM, Krunal Desai wrote: > An order of 6 the 5K3000 drives for work-related purposes shipped in a > Styrofoam holder of sorts that was cut in half for my small number of > drives (is this what 20 pks come in?). No idea what other packaging > was around them (shipping and receiving opened the packages). Yes, the 20 packs I have seen are a big box with a foam insert with 2 columns of 10 'slots' that hold a drive in anti-static plastic. P.S. I buy from CDW (and previously from Newegg) for home not work. Work tends to buy from Sun/Oracle via a reseller. I can't afford new Sun/Oracle for home use. -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16 at 14:29, Paul Kraus wrote: I have stopped buying drives (and everything else) from Newegg as they cannot be bothered to properly pack items. It is worth the extra $5 per drive to buy them from CDW (who uses factory approved packaging). Note that I made this change 5 or so years ago and Newegg may have changed their packaging since then. NewEgg packaging is exactly what you describe, unchanged in the last few years. Most recent newegg drive purchase was last week for me. --eric -- Eric D. Mudama edmud...@bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Still no way to recover a "corrupted" pool
On Fri, Apr 29, 2011 at 5:17 PM, Brandon High wrote: > On Fri, Apr 29, 2011 at 1:23 PM, Freddie Cash wrote: >> Running ZFSv28 on 64-bit FreeBSD 8-STABLE. > > I'd suggest trying to import the pool into snv_151a (Solaris 11 > Express), which is the reference and development platform for ZFS. Would not import in Solaris 11 Express. :( Could not even find any pools to import. Even when using "zpool import -d /dev/dsk" or any other import commands. Most likely due to using a FreeBSD-specific method of labelling the disks. I've since rebuilt the pool (a third time), using GPT partitions, labels on the partitions, and using the labels in the pool configuration. That should make it importable across OSes (FreeBSD, Solaris, Linux, etc). It's just frustrating that it's still possible to corrupt a pool in such a way that "nuke and pave" is the only solution. Especially when this same assertion was discussed in 2007 ... with no workaround or fix or whatnot implemented, four years later. What's most frustrating is that this is the third time I've built this pool due to corruption like this, within three months. :( -- Freddie Cash fjwc...@gmail.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Still no way to recover a "corrupted" pool
On Mon, May 16, 2011 at 1:55 PM, Freddie Cash wrote: > Would not import in Solaris 11 Express. :( Could not even find any > pools to import. Even when using "zpool import -d /dev/dsk" or any > other import commands. Most likely due to using a FreeBSD-specific > method of labelling the disks. I think someone solved this before by creating a directory and making symlinks to the correct partition/slices on each disk. Then you can use 'zpool import -d /tmp/foo' to do the import. eg: # mkdir /tmp/fbsd # create a temp directory to point to the p0 partitions of the relevant disks # ln -s /dev/dsk/c8t1d0p0 /tmp/fbsd/ # ln -s /dev/dsk/c8t2d0p0 /tmp/fbsd/ # ln -s /dev/dsk/c8t3d0p0 /tmp/fbsd/ # ln -s /dev/dsk/c8t4d0p0 /tmp/fbsd/ # zpool import -d /tmp/fbsd/ $POOLNAME I've never used FreeBSD so I can't offer any advice about which device name is correct or if this will work. Posts from February 2010 "Import zpool from FreeBSD in OpenSolaris" indicate that you want p0. > It's just frustrating that it's still possible to corrupt a pool in > such a way that "nuke and pave" is the only solution. Especially when I'm not sure it was the only solution, it's just the one you followed. > What's most frustrating is that this is the third time I've built this > pool due to corruption like this, within three months. :( You may have an underlying hardware problem, or there could be a bug in the FreeBSD implementation that you're tripping over. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely slow zpool scrub performance
> You mentioned that the pool was somewhat full, can you send the output > of 'zpool iostat -v pool0'? You can also try doing the following to > reduce 'metaslab_min_alloc_size' to 4K: > > echo "metaslab_min_alloc_size/Z 1000" | mdb -kw So just changing that setting moved my write rate from 40MB/s to 175MB/s. That's a huge improvement. It's still not as high as I used to see on this box- but at least now the array is useable again. Thanks for the suggestion! Any other tunables I should be taking a look at? -Don ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely slow zpool scrub performance
> Running a zpool scrub on our production pool is showing a scrub rate > of about 400K/s. (When this pool was first set up we saw rates in the > MB/s range during a scrub). Usually, something like this is caused by a bad drive. Can you post iostat -en output? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
2011-05-16 9:14, Richard Elling пишет: On May 15, 2011, at 10:18 AM, Jim Klimov wrote: Hi, Very interesting suggestions as I'm contemplating a Supermicro-based server for my work as well, but probably in a lower budget as a backup store for an aging Thumper (not as its superior replacement). Still, I have a couple of questions regarding your raidz layout recommendation. On one hand, I've read that as current drives get larger (while their random IOPS/MBPS don't grow nearly as fast with new generations), it is becoming more and more reasonable to use RAIDZ3 with 3 redundancy drives, at least for vdevs made of many disks - a dozen or so. When a drive fails, you still have two redundant parities, and with a resilver window expected to be in hours if not days range, I would want that airbag, to say the least. You know, failures rarely come one by one ;) Not to worry. If you add another level of redundancy, the data protection is improved by orders of magnitude. If the resilver time increases, the effect on data protection is reduced by a relatively small divisor. To get some sense of this, the MTBF is often 1,000,000,000 hours and there are only 24 hours in a day. If MTBFs were real, we'd never see disks failing within a year ;) Problem is, these values seem to be determined in an ivory-tower lab. An expensive-vendor edition of a drive running in a cooled data center with shock absorbers and other nice features does often live a lot longer than a similar OEM enterprise or consumer drive running in an apartment with varying weather around and often overheating and randomly vibrating with a dozen other disks rotating in the same box. The ramble about expensive-vendor drive editions comes from my memory of some forum or blog discussion which I can't point to now either, which suggested that vendors like Sun do not charge 5x-10x the price of the same label of OEM drive just for a nice corporate logo stamped onto the disk. Vendors were said to burn-in the drives in their labs for like half a year or a year before putting the survivors to the market. This implies that some of the drives did not survive a burn-in period, and indeed the MTBF for the remaining ones is higher because "infancy death" due to manufacturing problems soon after arrival to the end customer is unlikely for these particular tested devices. The long burn-in times were also said to be the partial reason why vendors never sell the biggest disks available on the market (does any vendor sell 3Tb with their own brand already? Sun-Oracle? IBM? HP?) Thus may be obscured as "certification process" which occasionally takes about as long - to see if the newest and greatest disks die within a year or so. Another implied idea in that discussion was that the vendors can influence OEMs in choice of components, an example in the thread being about different marks of steel for the ball bearings. Such choices can drive the price up with a reason - disks like that are more expensive to produce - but also increases their reliability. In fact, I've had very few Sun disks breaking in the boxes I've managed over 10 years; all I can remember now were two or three 2.5" 72Gb Fujitsus with a Sun brand. Still, we have another dozen of those running so far for several years. So yes, I can believe that Big Vendor Brand disks can boast huge MTBFs and prove that with a track record, and such drives are often replaced not because of a break-down, but rather as a precaution, and because of "moral aging", such as low speed and small volume. But for the rest of us (like Home-ZFS users) such numbers of MTBF are as fantastic as the Big Vendor prices, and inachievable for any number of reasons, starting with use of cheaper and potentially worse hardware from the beginning, and non-"orchard" conditions of running the machines... I do have some 5-year-old disks running in computers daily and still alive, but I have about as many which died young, sometimes even within the warranty period ;) On another hand, I've recently seen many recommendations that in a RAIDZ* drive set, the number of data disks should be a power of two - so that ZFS blocks/stripes and those of of its users (like databases) which are inclined to use 2^N-sized blocks can be often accessed in a single IO burst across all drives, and not in "one and one-quarter IO" on the average, which might delay IOs to other stripes while some of the disks in a vdev are busy processing leftovers of a previous request, and others are waiting for their peers. I've never heard of this and it doesn't pass the sniff test. Can you cite a source? I was trying to find an "authoritative" link today but failed. I know I've read this for many times over the past couple of months, but this may still be an "urban legend" or even FUD, retold many times... In fact, today I came across old posts from Jeff Bonwick, where he explains the disk usage and "ZFS striping" which is not like usual RAID striping. If th
Re: [zfs-discuss] Extremely slow zpool scrub performance
2011-05-16 22:21, George Wilson пишет: echo "metaslab_min_alloc_size/Z 1000" | mdb -kw Thanks, this also boosted my home box from hundreds of kb/s into several Mb/s range, which is much better (I'm evacuating data from a pool hosted in a volume inside my main pool, and the bottleneck is quite substantial) - now I'd get rid of this experiment much faster ;) -- ++ || | Климов Евгений, Jim Klimov | | технический директор CTO | | ЗАО "ЦОС и ВТ" JSC COS&HT | || | +7-903-7705859 (cellular) mailto:jimkli...@cos.ru | | CC:ad...@cos.ru,jimkli...@mail.ru | ++ | () ascii ribbon campaign - against html mail | | /\- against microsoft attachments | ++ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Paul Kraus > > All drives have a very high DOA rate according to Newegg. The > way they package drives for shipping is exactly how Seagate > specifically says NOT to pack them here 8 months ago, newegg says they've changed this practice. http://www.facebook.com/media/set/?set=a.438146824167.223805.5585759167 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely slow zpool scrub performance
As a followup: I ran the same DD test as earlier- but this time I stopped the scrub: pool0 14.1T 25.4T 88 4.81K 709K 262M pool0 14.1T 25.4T104 3.99K 836K 248M pool0 14.1T 25.4T360 5.01K 2.81M 230M pool0 14.1T 25.4T305 5.69K 2.38M 231M pool0 14.1T 25.4T389 5.85K 3.05M 293M pool0 14.1T 25.4T376 5.38K 2.94M 328M pool0 14.1T 25.4T295 3.29K 2.31M 286M ~# dd if=/dev/zero of=/pool0/ds.test bs=1024k count=2000 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 6.50394 s, 322 MB/s Stopping the scrub seemed to increase my performance by another 60% over the highest numbers I saw just from the metaslab change earlier (That peak was 201 MB/s). This is the performance I was seeing out of this array when newly built. I have two follow up questions: 1. We changed the metaslab size from 10M to 4k- that's a pretty drastic change. Is there some median value that should be used instead and/or is there a downside to using such a small metaslab size? 2. I'm still confused by the poor scrub performance and it's impact on the write performance. I'm not seeing a lot of IO's or processor load- so I'm wondering what else I might be missing. -Don ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16 at 21:55, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Paul Kraus All drives have a very high DOA rate according to Newegg. The way they package drives for shipping is exactly how Seagate specifically says NOT to pack them here 8 months ago, newegg says they've changed this practice. http://www.facebook.com/media/set/?set=a.438146824167.223805.5585759167 The drives I just bought were half packed in white foam then wrapped in bubble wrap. Not all edges were protected with more than bubble wrap. --eric -- Eric D. Mudama edmud...@bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss