Re: [zfs-discuss] ZFS & ZPOOL => trash

2011-05-16 Thread Jim Klimov
Well, if this is not a root disk and the server boots at least to single-user, 
as you wrote above, you can try to disable auto-import of this pool.

Easiest of all is to disable auto-imports of all pools by removing or renaming 
the file /etc/zfs/zpool.cache - it is a list of known pools for automatic 
import. Without it, only your root pool will be imported, and all other pools 
(those without problems) must be re-imported and re-cached by ZFS into this 
file.

Then your server will work (except the pool and local zone in it), and you can 
go on about fixing it. Did you already try the "zpool import -F" command?

Good luck,
//Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Sandon Van Ness

On 05/15/2011 09:58 PM, Richard Elling wrote:

  In one of my systems, I have 1TB mirrors, 70% full, which can be
sequentially completely read/written in 2 hrs.  But the resilver took 12
hours of idle time.  Supposing you had a 70% full pool of raidz3, 2TB disks,
using 10 disks + 3 parity, and a usage pattern similar to mine, your
resilver time would have been minimum 10 days,

bollix


likely approaching 20 or 30
days.  (Because you wouldn't get 2-3 weeks of consecutive idle time, and the
random access time for a raidz approaches 2x the random access time of a
mirror.)

totally untrue


BTW, the reason I chose 10+3 disks above was just because it makes
calculation easy.  It's easy to multiply by 10.  I'm not suggesting using
that configuration.  You may notice that I don't recommend raidz for most
situations.  I endorse mirrors because they minimize resilver time (and
maximize performance in general).  Resilver time is a problem for ZFS, which
they may fix someday.

Resilver time is not a significant problem with ZFS. Resilver time is a much
bigger problem with traditional RAID systems. In any case, it is bad systems
engineering to optimize a system for best resilver time.
  -- richard


Actually I have seen resilvers take a very long time (weeks) on 
solaris/raidz2 when I almost never see a hardware raid controller take 
more than a day or two. In one case i thrashed the disks absolutely as 
hard as I could (hardware controller) and finally was able to get the 
rebuild to take almost 1 week.. Here is an example of one right now:


   pool: raid3060
   state: ONLINE
   status: One or more devices is currently being resilvered. The pool will
   continue to function, possibly in a degraded state.
   action: Wait for the resilver to complete.
   scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go
   config:

ZFS resilver can take a very long time depending on your usage pattern. 
I do disagree with some things he said though... like a 1TB drive being 
able to be read/written in 2 hours? I seriously doubt this. Just reading 
1 TB in 2 hours means an average speed of over 130 megabytes/sec.


Only really new 1TB drives will even hit that type of speed at the 
begging of the drive and the average would be much closer to around 100 
MB/sec at the end of the drive. Also that is best case scenario. I know 
1TB drives (when they first came out) took aound 4-5 hours to do a 
complete read of all data on the disk at full speed.


Definitely no way to be that fast with reading *and* writing 1TB of data 
to the drive. I guess if you count reading from one and writing to the 
other. 3 hours is a much more likely figure and best case.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Giovanni Tirloni
On Mon, May 16, 2011 at 9:02 AM, Sandon Van Ness wrote:

>
> Actually I have seen resilvers take a very long time (weeks) on
> solaris/raidz2 when I almost never see a hardware raid controller take more
> than a day or two. In one case i thrashed the disks absolutely as hard as I
> could (hardware controller) and finally was able to get the rebuild to take
> almost 1 week.. Here is an example of one right now:
>
>   pool: raid3060
>   state: ONLINE
>   status: One or more devices is currently being resilvered. The pool will
>   continue to function, possibly in a degraded state.
>   action: Wait for the resilver to complete.
>   scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go
>   config:
>
>
Resilver has been a problem with RAIDZ volumes for a while. I've routinely
seen it take >300 hours and sometimes >600 hours with 13TB pools at 80%. All
disks are maxed out on IOPS while still reading 1-2MB/s and there rarely is
any writes. I've written about it before here (and provided data).

My only guess is that fragmentation is a real problem in a scrub/resilver
situation but whenever the conversation changes to point weaknesses in ZFS
we start seeing "that is not a problem" comments. With the 7000s appliance
I've heard that the 900hr estimated resilver time was "normal" and
"everything is working as expected". Can't help but think there is some
walled garden syndrome floating around.

-- 
Giovanni Tirloni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Karl Wagner
I have to agree. ZFS needs a more intelligent scrub/resilver algorithm, which 
can 'sequentialise' the process. 
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Giovanni Tirloni  wrote:

On Mon, May 16, 2011 at 9:02 AM, Sandon Van Ness  wrote:


Actually I have seen resilvers take a very long time (weeks) on solaris/raidz2 
when I almost never see a hardware raid controller take more than a day or two. 
In one case i thrashed the disks absolutely as hard as I could (hardware 
controller) and finally was able to get the rebuild to take almost 1 week.. 
Here is an example of one right now:

  pool: raid3060
  state: ONLINE
  status: One or more devices is currently being resilvered. The pool will
  continue to function, possibly in a degraded state.
  action: Wait for the resilver to complete.
  scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go
  config:


Resilver has been a problem with RAIDZ volumes for a while. I've routinely seen 
it take >300 hours and sometimes >600 hours with 13TB pools at 80%. All disks 
are maxed out on IOPS while still reading 1-2MB/s and there rarely is any 
writes. I've written about it before here (and provided data). 

My only guess is that fragmentation is a real problem in a scrub/resilver 
situation but whenever the conversation changes to point weaknesses in ZFS we 
start seeing "that is not a problem" comments. With the 7000s appliance I've 
heard that the 900hr estimated resilver time was "normal" and "everything is 
working as expected". Can't help but think there is some walled garden syndrome 
floating around.

-- 
Giovanni Tirloni

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Edward Ned Harvey
> From: Richard Elling [mailto:richard.ell...@gmail.com]
> 
> >  In one of my systems, I have 1TB mirrors, 70% full, which can be
> > sequentially completely read/written in 2 hrs.  But the resilver took 12
> > hours of idle time.  Supposing you had a 70% full pool of raidz3, 2TB
disks,
> > using 10 disks + 3 parity, and a usage pattern similar to mine, your
> > resilver time would have been minimum 10 days,
> 
> bollix
> 
> Resilver time is not a significant problem with ZFS. Resilver time is a
much
> bigger problem with traditional RAID systems. In any case, it is bad
systems
> engineering to optimize a system for best resilver time.

Because RE seems to be emotionally involved with ZFS resilver times, I don't
believe it's going to be productive for me to try addressing his off-hand
comments.  Instead, I'm only going to say this much:

In my system mentioned above, a complete disk can be copied to another
complete disk, sequentially, in 131 minutes.  But during idle time it took
12 hours because ZFS resilver only does the used parts of disk, in
essentially random order.  So ZFS resilver often takes many times longer
than a complete hardware-based complete disk resilver.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Edward Ned Harvey
> From: Sandon Van Ness [mailto:san...@van-ness.com]
> 
> ZFS resilver can take a very long time depending on your usage pattern.
> I do disagree with some things he said though... like a 1TB drive being
> able to be read/written in 2 hours? I seriously doubt this. Just reading
> 1 TB in 2 hours means an average speed of over 130 megabytes/sec.

1Gbit/sec sustainable sequential disk speed is not uncommon these days, and
it is in fact the performance of the disks in the system in question.  SATA
7.2krpm disks...  Not even special disks.  Just typical boring normal disks.


> Definitely no way to be that fast with reading *and* writing 1TB of data
> to the drive. I guess if you count reading from one and writing to the
> other. 3 hours is a much more likely figure and best case.

No need to read & write from the same drive.  You can read from one drive
and write to the other simultaneously at full speed.  If there is any
performance difference between read & write on these drives, it's not
measurable.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely slow zpool scrub performance

2011-05-16 Thread Donald Stahl
> Can you share your 'zpool status' output for both pools?
Faster, smaller server:
~# zpool status pool0
 pool: pool0
 state: ONLINE
 scan: scrub repaired 0 in 2h18m with 0 errors on Sat May 14 13:28:58 2011

Much larger, more capable server:
~# zpool status pool0 | head
 pool: pool0
 state: ONLINE
 scan: scrub in progress since Fri May 13 14:04:46 2011
173G scanned out of 14.2T at 737K/s, (scan is slow, no estimated time)
43K repaired, 1.19% done

The only other relevant line is:
c5t9d0  ONLINE   0 0 0  (repairing)

(That's new as of this morning- though it was still very slow before that)

> Also you may want to run the following a few times in a loop and
> provide the output:
>
> # echo "::walk spa | ::print spa_t spa_name spa_last_io
> spa_scrub_inflight" | mdb -k
~# echo "::walk spa | ::print spa_t spa_name spa_last_io
> spa_scrub_inflight" | mdb -k
spa_name = [ "pool0" ]
spa_last_io = 0x159b275a
spa_name = [ "rpool" ]
spa_last_io = 0x159b210a
mdb: failed to dereference symbol: unknown symbol name

I'm pretty sure that's not the output you were looking for :)

On the same theme- is there a good reference for all of the various
ZFS debugging commands and mdb options?

I'd love to spend a lot of time just looking at the data available to
me but every time I turn around someone suggests a new and interesting
mdb query I've never seen before.

Thanks,
-Don
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Richard Elling
On May 16, 2011, at 5:02 AM, Sandon Van Ness  wrote:

> On 05/15/2011 09:58 PM, Richard Elling wrote:
>>>  In one of my systems, I have 1TB mirrors, 70% full, which can be
>>> sequentially completely read/written in 2 hrs.  But the resilver took 12
>>> hours of idle time.  Supposing you had a 70% full pool of raidz3, 2TB disks,
>>> using 10 disks + 3 parity, and a usage pattern similar to mine, your
>>> resilver time would have been minimum 10 days,
>> bollix
>> 
>>> likely approaching 20 or 30
>>> days.  (Because you wouldn't get 2-3 weeks of consecutive idle time, and the
>>> random access time for a raidz approaches 2x the random access time of a
>>> mirror.)
>> totally untrue
>> 
>>> BTW, the reason I chose 10+3 disks above was just because it makes
>>> calculation easy.  It's easy to multiply by 10.  I'm not suggesting using
>>> that configuration.  You may notice that I don't recommend raidz for most
>>> situations.  I endorse mirrors because they minimize resilver time (and
>>> maximize performance in general).  Resilver time is a problem for ZFS, which
>>> they may fix someday.
>> Resilver time is not a significant problem with ZFS. Resilver time is a much
>> bigger problem with traditional RAID systems. In any case, it is bad systems
>> engineering to optimize a system for best resilver time.
>>  -- richard
> 
> Actually I have seen resilvers take a very long time (weeks) on 
> solaris/raidz2 when I almost never see a hardware raid controller take more 
> than a day or two. In one case i thrashed the disks absolutely as hard as I 
> could (hardware controller) and finally was able to get the rebuild to take 
> almost 1 week.. Here is an example of one right now:
> 
>   pool: raid3060
>   state: ONLINE
>   status: One or more devices is currently being resilvered. The pool will
>   continue to function, possibly in a degraded state.
>   action: Wait for the resilver to complete.
>   scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go
>   config:

I have seen worse cases, but the root cause was hardware failures
that are not reported by zpool status. Have you checked the health
of the disk transports? Hint: fmdump -e

Also, what zpool version is this? There were improvements made in the
prefetch and the introduction of throttles last year. One makes it faster,
the other intentionally slows it down.

As a rule of thumb, the resilvering disk is expected to max out at around
80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect
the throttles or broken data path.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread John Doe
following are some thoughts if it's not too late:

> 1 SuperMicro  847E1-R1400LPB
I guess you meant the 847E1[b]6[/b]-R1400LPB, the SAS1 version makes no sense

> 1 SuperMicro  H8DG6-F
not the best choice, see below why

> 171   Hitachi 7K3000 3TB
I'd go for the more environmentally friendly Ultrastar 5K3000 version - with 
that many drives you wont mind the slower rotation but WILL notice a difference 
in power and cooling cost

> 1 LSI SAS 9202-16e
this is really only a very expensive gadget to be honest, there's really no 
point to it - especially true when you start looking for the necessary cables 
that use a connector who's still in "draft" specification...

stick to the excellent LSI SAS9200-8e, of which you will need at least 3 in 
your setup, one to connect each of the 3 JBODS - with them filled with fast 
drives like you chose, you will need two links (one for the front and one for 
the back backplane) as daisychainig the backplanes together would oversaturate 
a single link. 

if you'd want to take advantage of the dual expanders on your JBOD backplanes 
for additional redundancy in case of expander or controller failure, you will 
need 6 of those  LSI SAS9200-8e - this is where your board isn't ideal as it 
has a 3/1/2 PCIe x16/x8/x4 configuration while you'd need 6 PCIe x8 - something 
the X8DTH-6F will provide, as well as the onboard LSI SAS2008 based HBA for the 
two backplanes in the server case.

> 1 LSI SAS 9211-4i
> 2 OCZ 64GB SSD Vertex 3
> 2 OCZ 256GB SSD Vertex 3
if these are meant to be connected together and used as ZIL+L2ARC, then I'd 
STRONGLY urge you to get the following instead:
1x LSI MegaRAID SAS 9265-8i 
1x LSI FastPath licence
4-8x 120GB or 240GB Vertex 3 Max IOPS Edition, whatever suits the budget

this solution allows you to push around 400k IOPS to the cache, more than 
likely way more than the stated appication of the system will need

> 1 NeterionX3120SR0001
I don't know this card personally but since it's not listed as supported 
(http://www.sun.com/io_technologies/nic/NIC1.html) I'd be careful

> My question is what is the optimum way of dividing
> these drives across vdevs?
I would do 14 x 12 drive raidz2 + 3 spare = 140*3TB = ~382TiB usable
this would allow for a logical mapping of drives to vdevs, giving you in each 
case 2 vdevs in the front and 1 in the back with the 9 drive blocks in the back 
of the JBODs used as 3 x 4/4/1, giving the remaining 2 x 12 drive vdevs plus 
one spare per case

> I could also go with 2TB drives and add an extra 45
> JBOD chassis. This would significantly decrease cost,
> but I'm running a gauntlet by getting very close to
> minimum useable space.
> 
> 12 x 18 drive raidz2
I would never do vdevs that large, it's just an accident waiting to happen!


hopefully these recommendations help you with your project. in any case, it's 
huge - the biggest system I worked on (which I actually have at home, go 
figure) only has a bit over 100TB in the following configuration:
6 x 12 drive raidz2 of Hitachi 5K3000 2TB
3 Norco 4224 with a HP SAS Expander in each
Supermicro X8DTi-LN4F with 3x LSI SAS9200-8e

so yeah, I based my thoughts on my own system but considering that it's been 
running smoothly for a while now (and that I had a very similar setup with 
smaller drives and older controllers before), I'm confident in my suggestions

Regards from Switzerland,
voyman
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS & ZPOOL => trash

2011-05-16 Thread Andrea Ciuffoli
All these zpool corrupted are the root of local zones
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Brandon High
On Sat, May 14, 2011 at 11:20 PM, John Doe  wrote:
>> 171   Hitachi 7K3000 3TB
> I'd go for the more environmentally friendly Ultrastar 5K3000 version - with 
> that many drives you wont mind the slower rotation but WILL notice a 
> difference in power and cooling cost

A word of caution - The Hitachi Deskstar 5K3000 drives in 1TB and 2TB
are different than the 3TB.

The 1TB and 2TB are manufactured in China, and have a very high
failure and DOA rate according to Newegg.

The 3TB drives come off the same production line as the Ultrastar
5K3000 in Thailand and may be more reliable.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely slow zpool scrub performance

2011-05-16 Thread George Wilson
Don,

Can you send the entire 'zpool status' output? I wanted to see your
pool configuration. Also run the mdb command in a loop (at least 5
tiimes) so we can see if spa_last_io is changing. I'm surprised you're
not finding the symbol for 'spa_scrub_inflight' too.  Can you check
that you didn't mistype this?

Thanks,
George

On Mon, May 16, 2011 at 7:41 AM, Donald Stahl  wrote:
>> Can you share your 'zpool status' output for both pools?
> Faster, smaller server:
> ~# zpool status pool0
>  pool: pool0
>  state: ONLINE
>  scan: scrub repaired 0 in 2h18m with 0 errors on Sat May 14 13:28:58 2011
>
> Much larger, more capable server:
> ~# zpool status pool0 | head
>  pool: pool0
>  state: ONLINE
>  scan: scrub in progress since Fri May 13 14:04:46 2011
>    173G scanned out of 14.2T at 737K/s, (scan is slow, no estimated time)
>    43K repaired, 1.19% done
>
> The only other relevant line is:
>            c5t9d0          ONLINE       0     0     0  (repairing)
>
> (That's new as of this morning- though it was still very slow before that)
>
>> Also you may want to run the following a few times in a loop and
>> provide the output:
>>
>> # echo "::walk spa | ::print spa_t spa_name spa_last_io
>> spa_scrub_inflight" | mdb -k
> ~# echo "::walk spa | ::print spa_t spa_name spa_last_io
>> spa_scrub_inflight" | mdb -k
> spa_name = [ "pool0" ]
> spa_last_io = 0x159b275a
> spa_name = [ "rpool" ]
> spa_last_io = 0x159b210a
> mdb: failed to dereference symbol: unknown symbol name
>
> I'm pretty sure that's not the output you were looking for :)
>
> On the same theme- is there a good reference for all of the various
> ZFS debugging commands and mdb options?
>
> I'd love to spend a lot of time just looking at the data available to
> me but every time I turn around someone suggests a new and interesting
> mdb query I've never seen before.
>
> Thanks,
> -Don
>



-- 
George Wilson



M: +1.770.853.8523
F: +1.650.494.1676
275 Middlefield Road, Suite 50
Menlo Park, CA 94025
http://www.delphix.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Brandon High
On Mon, May 16, 2011 at 8:33 AM, Richard Elling
 wrote:
> As a rule of thumb, the resilvering disk is expected to max out at around
> 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect
> the throttles or broken data path.

My system was doing far less than 80 IOPS during resilver when I
recently upgraded the drives. The older and newer drives were both 5k
RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to
be super fast.

The worst resilver was 50 hours, the best was about 20 hours. This was
just my home server, which is lightly used. The clients (2-3 CIFS
clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS
clients) are mostly idle and don't do a lot of writes.

Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things
up a bit, which suggests that the default values may be too
conservative for some environments.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely slow zpool scrub performance

2011-05-16 Thread Donald Stahl
> Can you send the entire 'zpool status' output? I wanted to see your
> pool configuration. Also run the mdb command in a loop (at least 5
> tiimes) so we can see if spa_last_io is changing. I'm surprised you're
> not finding the symbol for 'spa_scrub_inflight' too.  Can you check
> that you didn't mistype this?
I copy and pasted to make sure that wasn't the issue :)

I will run it in a loop this time. I didn't do it last time because of
the error.

This box was running only raidz sets originally. After running into
performance problems we added a bunch of mirrors to try to improve the
iops. The logs are not mirrored right now as we were testing adding
the other two as cache disks to see if that helped. We've also tested
using a ramdisk ZIL to see if that made any difference- it did not.

The performance on this box was excellent until it started to fill up
(somewhere around 70%)- then performance degraded significantly. We
added more disks, and copied the data around to rebalance things. It
seems to have helped somewhat- but it is nothing like when we first
created the array.

config:

NAMESTATE READ WRITE CKSUM
pool0   ONLINE   0 0 0
  raidz1-0  ONLINE   0 0 0
c5t5d0  ONLINE   0 0 0
c5t6d0  ONLINE   0 0 0
c5t7d0  ONLINE   0 0 0
c5t8d0  ONLINE   0 0 0
  raidz1-1  ONLINE   0 0 0
c5t9d0  ONLINE   0 0 0  (repairing)
c5t10d0 ONLINE   0 0 0
c5t11d0 ONLINE   0 0 0
c5t12d0 ONLINE   0 0 0
  raidz1-2  ONLINE   0 0 0
c5t13d0 ONLINE   0 0 0
c5t14d0 ONLINE   0 0 0
c5t15d0 ONLINE   0 0 0
c5t16d0 ONLINE   0 0 0
  raidz1-3  ONLINE   0 0 0
c5t21d0 ONLINE   0 0 0
c5t22d0 ONLINE   0 0 0
c5t23d0 ONLINE   0 0 0
c5t24d0 ONLINE   0 0 0
  raidz1-4  ONLINE   0 0 0
c5t25d0 ONLINE   0 0 0
c5t26d0 ONLINE   0 0 0
c5t27d0 ONLINE   0 0 0
c5t28d0 ONLINE   0 0 0
  raidz1-5  ONLINE   0 0 0
c5t29d0 ONLINE   0 0 0
c5t30d0 ONLINE   0 0 0
c5t31d0 ONLINE   0 0 0
c5t32d0 ONLINE   0 0 0
  raidz1-6  ONLINE   0 0 0
c5t33d0 ONLINE   0 0 0
c5t34d0 ONLINE   0 0 0
c5t35d0 ONLINE   0 0 0
c5t36d0 ONLINE   0 0 0
  raidz1-7  ONLINE   0 0 0
c5t37d0 ONLINE   0 0 0
c5t38d0 ONLINE   0 0 0
c5t39d0 ONLINE   0 0 0
c5t40d0 ONLINE   0 0 0
  raidz1-8  ONLINE   0 0 0
c5t41d0 ONLINE   0 0 0
c5t42d0 ONLINE   0 0 0
c5t43d0 ONLINE   0 0 0
c5t44d0 ONLINE   0 0 0
  raidz1-10 ONLINE   0 0 0
c5t45d0 ONLINE   0 0 0
c5t46d0 ONLINE   0 0 0
c5t47d0 ONLINE   0 0 0
c5t48d0 ONLINE   0 0 0
  raidz1-11 ONLINE   0 0 0
c5t49d0 ONLINE   0 0 0
c5t50d0 ONLINE   0 0 0
c5t51d0 ONLINE   0 0 0
c5t52d0 ONLINE   0 0 0
  raidz1-12 ONLINE   0 0 0
c5t53d0 ONLINE   0 0 0
c5t54d0 ONLINE   0 0 0
c5t55d0 ONLINE   0 0 0
c5t56d0 ONLINE   0 0 0
  raidz1-13 ONLINE   0 0 0
c5t57d0 ONLINE   0 0 0
c5t58d0 ONLINE   0 0 0
c5t59d0 ONLINE   0 0 0
c5t60d0 ONLINE   0 0 0
  raidz1-14 ONLINE   0 0 0
c5t61d0 ONLINE   0 0 0
c5t62d0 ONLINE   0 0 0
c5t63d0 ONLI

Re: [zfs-discuss] Extremely slow zpool scrub performance

2011-05-16 Thread Donald Stahl
> I copy and pasted to make sure that wasn't the issue :)
Which, ironically, turned out to be the problem- there was an extra
carriage return in there that mdb did not like:

Here is the output:

spa_name = [ "pool0" ]
spa_last_io = 0x82721a4
spa_scrub_inflight = 0x1

spa_name = [ "pool0" ]
spa_last_io = 0x8272240
spa_scrub_inflight = 0x1

spa_name = [ "pool0" ]
spa_last_io = 0x82722f0
spa_scrub_inflight = 0x1

spa_name = [ "pool0" ]
spa_last_io = 0x827239e
spa_scrub_inflight = 0

spa_name = [ "pool0" ]
spa_last_io = 0x8272441
spa_scrub_inflight = 0x1
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely slow zpool scrub performance

2011-05-16 Thread Donald Stahl
Here is another example of the performance problems I am seeing:

~# dd if=/dev/zero of=/pool0/ds.test bs=1024k count=2000 2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 56.2184 s, 37.3 MB/s

37MB/s seems like some sort of bad joke for all these disks. I can
write the same amount of data to a set of 6 SAS disks on a Dell
PERC6/i at a rate of 160MB/s and those disks are hosting 25 vm's and a
lot more IOPS than this box.

zpool iostat during the same time shows:
pool0   14.2T  25.3T124  1.30K   981K  4.02M
pool0   14.2T  25.3T277914  2.16M  23.2M
pool0   14.2T  25.3T 65  4.03K   526K  90.2M
pool0   14.2T  25.3T 18  1.76K   136K  6.81M
pool0   14.2T  25.3T460  5.55K  3.60M   111M
pool0   14.2T  25.3T160  0  1.24M  0
pool0   14.2T  25.3T182  2.34K  1.41M  33.3M

The zero's and other low numbers don't make any sense. And as I
mentioned- the busy percent and service times of these disks are never
abnormally high- especially when compared to the much smaller, better
performing pool I have.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Richard Elling
On May 16, 2011, at 10:31 AM, Brandon High wrote:
> On Mon, May 16, 2011 at 8:33 AM, Richard Elling
>  wrote:
>> As a rule of thumb, the resilvering disk is expected to max out at around
>> 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect
>> the throttles or broken data path.
> 
> My system was doing far less than 80 IOPS during resilver when I
> recently upgraded the drives. The older and newer drives were both 5k
> RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to
> be super fast.
> 
> The worst resilver was 50 hours, the best was about 20 hours. This was
> just my home server, which is lightly used. The clients (2-3 CIFS
> clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS
> clients) are mostly idle and don't do a lot of writes.
> 
> Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things
> up a bit, which suggests that the default values may be too
> conservative for some environments.

I am more inclined to change the hires_tick value. The "delays" are in 
units of clock ticks. For Solaris, the default clock tick is 10ms, that I will
argue is too large for modern disk systems. What this means is that when 
the resilver, scrub, or memory throttle causes delays, the effective IOPS is
driven to 10 or less. Unfortunately, these values are guesses and are 
probably suboptimal for various use cases. OTOH, the prior behaviour of
no resilver or scrub throttle was also considered a bad thing.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely slow zpool scrub performance

2011-05-16 Thread George Wilson
You mentioned that the pool was somewhat full, can you send the output
of 'zpool iostat -v pool0'? You can also try doing the following to
reduce 'metaslab_min_alloc_size' to 4K:

echo "metaslab_min_alloc_size/Z 1000" | mdb -kw

NOTE: This will change the running system so you may want to make this
change during off-peak hours.

Then check your performance and see if it makes a difference.

- George


On Mon, May 16, 2011 at 10:58 AM, Donald Stahl  wrote:
> Here is another example of the performance problems I am seeing:
>
> ~# dd if=/dev/zero of=/pool0/ds.test bs=1024k count=2000 2000+0 records in
> 2000+0 records out
> 2097152000 bytes (2.1 GB) copied, 56.2184 s, 37.3 MB/s
>
> 37MB/s seems like some sort of bad joke for all these disks. I can
> write the same amount of data to a set of 6 SAS disks on a Dell
> PERC6/i at a rate of 160MB/s and those disks are hosting 25 vm's and a
> lot more IOPS than this box.
>
> zpool iostat during the same time shows:
> pool0       14.2T  25.3T    124  1.30K   981K  4.02M
> pool0       14.2T  25.3T    277    914  2.16M  23.2M
> pool0       14.2T  25.3T     65  4.03K   526K  90.2M
> pool0       14.2T  25.3T     18  1.76K   136K  6.81M
> pool0       14.2T  25.3T    460  5.55K  3.60M   111M
> pool0       14.2T  25.3T    160      0  1.24M      0
> pool0       14.2T  25.3T    182  2.34K  1.41M  33.3M
>
> The zero's and other low numbers don't make any sense. And as I
> mentioned- the busy percent and service times of these disks are never
> abnormally high- especially when compared to the much smaller, better
> performing pool I have.
>



-- 
George Wilson



M: +1.770.853.8523
F: +1.650.494.1676
275 Middlefield Road, Suite 50
Menlo Park, CA 94025
http://www.delphix.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Krunal Desai
On Mon, May 16, 2011 at 1:20 PM, Brandon High  wrote:
> The 1TB and 2TB are manufactured in China, and have a very high
> failure and DOA rate according to Newegg.
>
> The 3TB drives come off the same production line as the Ultrastar
> 5K3000 in Thailand and may be more reliable.

Thanks for the heads up, I was thinking about 5K3000s to finish out my
build (currently have Barracuda LPs). I do wonder how much of that DOA
is due to newegg HDD packaging/shipping, however.

--khd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely slow zpool scrub performance

2011-05-16 Thread Donald Stahl
> You mentioned that the pool was somewhat full, can you send the output
> of 'zpool iostat -v pool0'?

~# zpool iostat -v pool0
   capacity operationsbandwidth
poolalloc   free   read  write   read  write
--  -  -  -  -  -  -
pool0   14.1T  25.4T926  2.35K  7.20M  15.7M
  raidz1 673G   439G 42117   335K   790K
c5t5d0  -  - 20 20   167K   273K
c5t6d0  -  - 20 20   167K   272K
c5t7d0  -  - 20 20   167K   273K
c5t8d0  -  - 20 20   167K   272K
  raidz1 710G   402G 38 84   309K   546K
c5t9d0  -  - 18 16   158K   189K
c5t10d0 -  - 18 16   157K   187K
c5t11d0 -  - 18 16   158K   189K
c5t12d0 -  - 18 16   157K   187K
  raidz1 719G   393G 43 95   348K   648K
c5t13d0 -  - 20 17   172K   224K
c5t14d0 -  - 20 17   171K   223K
c5t15d0 -  - 20 17   172K   224K
c5t16d0 -  - 20 17   172K   223K
  raidz1 721G   391G 42 96   341K   653K
c5t21d0 -  - 20 16   170K   226K
c5t22d0 -  - 20 16   169K   224K
c5t23d0 -  - 20 16   170K   226K
c5t24d0 -  - 20 16   170K   224K
  raidz1 721G   391G 43100   342K   667K
c5t25d0 -  - 20 17   172K   231K
c5t26d0 -  - 20 17   172K   229K
c5t27d0 -  - 20 17   172K   231K
c5t28d0 -  - 20 17   172K   229K
  raidz1 721G   391G 43101   341K   672K
c5t29d0 -  - 20 18   173K   233K
c5t30d0 -  - 20 18   173K   231K
c5t31d0 -  - 20 18   173K   233K
c5t32d0 -  - 20 18   173K   231K
  raidz1 722G   390G 42100   339K   667K
c5t33d0 -  - 20 19   171K   231K
c5t34d0 -  - 20 19   172K   229K
c5t35d0 -  - 20 19   171K   231K
c5t36d0 -  - 20 19   171K   229K
  raidz1 709G   403G 42107   341K   714K
c5t37d0 -  - 20 20   171K   247K
c5t38d0 -  - 20 19   170K   245K
c5t39d0 -  - 20 20   171K   247K
c5t40d0 -  - 20 19   170K   245K
  raidz1 744G   368G 39 79   316K   530K
c5t41d0 -  - 18 16   163K   183K
c5t42d0 -  - 18 15   163K   182K
c5t43d0 -  - 18 16   163K   183K
c5t44d0 -  - 18 15   163K   182K
  raidz1 737G   375G 44 98   355K   668K
c5t45d0 -  - 21 18   178K   231K
c5t46d0 -  - 21 18   178K   229K
c5t47d0 -  - 21 18   178K   231K
c5t48d0 -  - 21 18   178K   229K
  raidz1 733G   379G 43103   344K   683K
c5t49d0 -  - 20 19   175K   237K
c5t50d0 -  - 20 19   175K   235K
c5t51d0 -  - 20 19   175K   237K
c5t52d0 -  - 20 19   175K   235K
  raidz1 732G   380G 43104   344K   685K
c5t53d0 -  - 20 19   176K   237K
c5t54d0 -  - 20 19   175K   235K
c5t55d0 -  - 20 19   175K   237K
c5t56d0 -  - 20 19   175K   235K
  raidz1 733G   379G 43101   344K   672K
c5t57d0 -  - 20 17   175K   233K
c5t58d0 -  - 20 17   174K   231K
c5t59d0 -  - 20 17   175K   233K
c5t60d0 -  - 20 17   174K   231K
  raidz1 806G  1.38T 50123   401K   817K
c5t61d0 -  - 24 22   201K   283K
c5t62d0 -  - 24 22   201K   281K
c5t63d0 -  - 24 22   201K   283K
c5t64d0 -  - 24 22   201K   281K
  raidz1 794G  1.40T 47120   377K   786K
c5t65d0 -  - 22 23   194K   272K
c5t66d0 -  - 22 23   194K   270K
c5t67d0 -  - 22 23   194K   272K
c5t68d0 -  - 22 23   194K   270K
  raidz1 788G  1.40T 47115   376

Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Paul Kraus
On Mon, May 16, 2011 at 1:20 PM, Brandon High  wrote:

> The 1TB and 2TB are manufactured in China, and have a very high
> failure and DOA rate according to Newegg.

All drives have a very high DOA rate according to Newegg. The
way they package drives for shipping is exactly how Seagate
specifically says NOT to pack them here
http://www.seagate.com/ww/v/index.jsp?locale=en-US&name=what-to-pack&vgnextoid=5c3a8bc90bf03210VgnVCM101a48090aRCRD

I have stopped buying drives (and everything else) from Newegg
as they cannot be bothered to properly pack items. It is worth the
extra $5 per drive to buy them from CDW (who uses factory approved
packaging). Note that I made this change 5 or so years ago and Newegg
may have changed their packaging since then.

 What Newegg was doing is buying drives in the 20-pack from the
manufacturer and packing them individually WRAPPED IN BUBBLE WRAP and
then stuffed in a box. No clamshell. I realized *something* was up
when _every_ drive I looked at had a much higher report of DOA (or
early failure) at the Newegg reviews than made any sense (and compared
to other site's reviews).

 This is NOT to say that the drives in question really don't have
a QC issue, just that the reports via Newegg are biased by Newegg's
packing / shipping practices.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Krunal Desai
On Mon, May 16, 2011 at 2:29 PM, Paul Kraus  wrote:
> What Newegg was doing is buying drives in the 20-pack from the
> manufacturer and packing them individually WRAPPED IN BUBBLE WRAP and
> then stuffed in a box. No clamshell. I realized *something* was up
> when _every_ drive I looked at had a much higher report of DOA (or
> early failure) at the Newegg reviews than made any sense (and compared
> to other site's reviews).

I picked up a single 5K3000 last week, have not powered it on yet, but
it came in a pseudo-OEM box with clamshells. I remember getting
bubble-wrapped single drives from Newegg, and more than a fair share
of those drives suffered early deaths or never powered on in the first
place. No complaints about Amazon: Seagate drives came in Seagate OEM
boxes with free shipping via Prime. (probably not practical for you
enterprise/professional guys, but nice for home users).

An order of 6 the 5K3000 drives for work-related purposes shipped in a
Styrofoam holder of sorts that was cut in half for my small number of
drives (is this what 20 pks come in?). No idea what other packaging
was around them (shipping and receiving opened the packages).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Garrett D'Amore
Actually it is 100 or less, i.e. a 10 msec delay.

  -- Garrett D'Amore

On May 16, 2011, at 11:13 AM, "Richard Elling"  wrote:

> On May 16, 2011, at 10:31 AM, Brandon High wrote:
>> On Mon, May 16, 2011 at 8:33 AM, Richard Elling
>>  wrote:
>>> As a rule of thumb, the resilvering disk is expected to max out at around
>>> 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect
>>> the throttles or broken data path.
>> 
>> My system was doing far less than 80 IOPS during resilver when I
>> recently upgraded the drives. The older and newer drives were both 5k
>> RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to
>> be super fast.
>> 
>> The worst resilver was 50 hours, the best was about 20 hours. This was
>> just my home server, which is lightly used. The clients (2-3 CIFS
>> clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS
>> clients) are mostly idle and don't do a lot of writes.
>> 
>> Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things
>> up a bit, which suggests that the default values may be too
>> conservative for some environments.
> 
> I am more inclined to change the hires_tick value. The "delays" are in 
> units of clock ticks. For Solaris, the default clock tick is 10ms, that I will
> argue is too large for modern disk systems. What this means is that when 
> the resilver, scrub, or memory throttle causes delays, the effective IOPS is
> driven to 10 or less. Unfortunately, these values are guesses and are 
> probably suboptimal for various use cases. OTOH, the prior behaviour of
> no resilver or scrub throttle was also considered a bad thing.
> -- richard
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Paul Kraus
On Mon, May 16, 2011 at 2:35 PM, Krunal Desai  wrote:

> An order of 6 the 5K3000 drives for work-related purposes shipped in a
> Styrofoam holder of sorts that was cut in half for my small number of
> drives (is this what 20 pks come in?). No idea what other packaging
> was around them (shipping and receiving opened the packages).

Yes, the 20 packs I have seen are a big box with a foam insert with 2
columns of 10 'slots' that hold a drive in anti-static plastic.

P.S. I buy from CDW (and previously from Newegg) for home not work.
Work tends to buy from Sun/Oracle via a reseller. I can't afford new
Sun/Oracle for home use.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Eric D. Mudama

On Mon, May 16 at 14:29, Paul Kraus wrote:

   I have stopped buying drives (and everything else) from Newegg
as they cannot be bothered to properly pack items. It is worth the
extra $5 per drive to buy them from CDW (who uses factory approved
packaging). Note that I made this change 5 or so years ago and Newegg
may have changed their packaging since then.


NewEgg packaging is exactly what you describe, unchanged in the last
few years.  Most recent newegg drive purchase was last week for me.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Still no way to recover a "corrupted" pool

2011-05-16 Thread Freddie Cash
On Fri, Apr 29, 2011 at 5:17 PM, Brandon High  wrote:
> On Fri, Apr 29, 2011 at 1:23 PM, Freddie Cash  wrote:
>> Running ZFSv28 on 64-bit FreeBSD 8-STABLE.
>
> I'd suggest trying to import the pool into snv_151a (Solaris 11
> Express), which is the reference and development platform for ZFS.

Would not import in Solaris 11 Express.  :(  Could not even find any
pools to import.  Even when using "zpool import -d /dev/dsk" or any
other import commands.  Most likely due to using a FreeBSD-specific
method of labelling the disks.

I've since rebuilt the pool (a third time), using GPT partitions,
labels on the partitions, and using the labels in the pool
configuration.  That should make it importable across OSes (FreeBSD,
Solaris, Linux, etc).

It's just frustrating that it's still possible to corrupt a pool in
such a way that "nuke and pave" is the only solution.  Especially when
this same assertion was discussed in 2007 ... with no workaround or
fix or whatnot implemented, four years later.

What's most frustrating is that this is the third time I've built this
pool due to corruption like this, within three months.  :(

-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Still no way to recover a "corrupted" pool

2011-05-16 Thread Brandon High
On Mon, May 16, 2011 at 1:55 PM, Freddie Cash  wrote:
> Would not import in Solaris 11 Express.  :(  Could not even find any
> pools to import.  Even when using "zpool import -d /dev/dsk" or any
> other import commands.  Most likely due to using a FreeBSD-specific
> method of labelling the disks.

I think someone solved this before by creating a directory and making
symlinks to the correct partition/slices on each disk. Then you can
use 'zpool import -d /tmp/foo' to do the import. eg:

# mkdir /tmp/fbsd # create a temp directory to point to the p0
partitions of the relevant disks
# ln -s /dev/dsk/c8t1d0p0 /tmp/fbsd/
# ln -s /dev/dsk/c8t2d0p0 /tmp/fbsd/
# ln -s /dev/dsk/c8t3d0p0 /tmp/fbsd/
# ln -s /dev/dsk/c8t4d0p0 /tmp/fbsd/
# zpool import -d /tmp/fbsd/ $POOLNAME

I've never used FreeBSD so I can't offer any advice about which device
name is correct or if this will work. Posts from February 2010 "Import
zpool from FreeBSD in OpenSolaris" indicate that you want p0.

> It's just frustrating that it's still possible to corrupt a pool in
> such a way that "nuke and pave" is the only solution.  Especially when

I'm not sure it was the only solution, it's just the one you followed.

> What's most frustrating is that this is the third time I've built this
> pool due to corruption like this, within three months.  :(

You may have an underlying hardware problem, or there could be a bug
in the FreeBSD implementation that you're tripping over.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely slow zpool scrub performance

2011-05-16 Thread Donald Stahl
> You mentioned that the pool was somewhat full, can you send the output
> of 'zpool iostat -v pool0'? You can also try doing the following to
> reduce 'metaslab_min_alloc_size' to 4K:
>
> echo "metaslab_min_alloc_size/Z 1000" | mdb -kw
So just changing that setting moved my write rate from 40MB/s to 175MB/s.

That's a huge improvement. It's still not as high as I used to see on
this box- but at least now the array is useable again. Thanks for the
suggestion!

Any other tunables I should be taking a look at?

-Don
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely slow zpool scrub performance

2011-05-16 Thread Roy Sigurd Karlsbakk
> Running a zpool scrub on our production pool is showing a scrub rate
> of about 400K/s. (When this pool was first set up we saw rates in the
> MB/s range during a scrub).

Usually, something like this is caused by a bad drive. Can you post iostat -en 
output?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Jim Klimov

2011-05-16 9:14, Richard Elling пишет:

On May 15, 2011, at 10:18 AM, Jim Klimov  wrote:


Hi, Very interesting suggestions as I'm contemplating a Supermicro-based server 
for my work as well, but probably in a lower budget as a backup store for an 
aging Thumper (not as its superior replacement).

Still, I have a couple of questions regarding your raidz layout recommendation.

On one hand, I've read that as current drives get larger (while their random 
IOPS/MBPS don't grow nearly as fast with new generations), it is becoming more 
and more reasonable to use RAIDZ3 with 3 redundancy drives, at least for vdevs 
made of many disks - a dozen or so. When a drive fails, you still have two 
redundant parities, and with a resilver window expected to be in hours if not 
days range, I would want that airbag, to say the least. You know, failures 
rarely come one by one ;)

Not to worry. If you add another level of redundancy, the data protection
is improved by orders of magnitude. If the resilver time increases, the effect
on data protection is reduced by a relatively small divisor. To get some sense
of this, the MTBF is often 1,000,000,000 hours and there are only 24 hours in
a day.



If MTBFs were real, we'd never see disks failing within a year ;)

Problem is, these values seem to be determined in an ivory-tower
lab. An expensive-vendor edition of a drive running in a cooled
data center with shock absorbers and other nice features does
often live a lot longer than a similar OEM enterprise or consumer
drive running in an apartment with varying weather around and
often overheating and randomly vibrating with a dozen other
disks rotating in the same box.

The ramble about expensive-vendor drive editions comes from
my memory of some forum or blog discussion which I can't point
to now either, which suggested that vendors like Sun do not
charge 5x-10x the price of the same label of OEM drive just
for a nice corporate logo stamped onto the disk. Vendors were
said to burn-in the drives in their labs for like half a year or a
year before putting the survivors to the market. This implies
that some of the drives did not survive a burn-in period, and
indeed the MTBF for the remaining ones is higher because
"infancy death" due to manufacturing problems soon after
arrival to the end customer is unlikely for these particular
tested devices. The long burn-in times were also said to
be the partial reason why vendors never sell the biggest
disks available on the market (does any vendor sell 3Tb
with their own brand already? Sun-Oracle? IBM? HP?)
Thus may be obscured as "certification process" which
occasionally takes about as long - to see if the newest
and greatest disks die within a year or so.

Another implied idea in that discussion was that the vendors
can influence OEMs in choice of components, an example
in the thread being about different marks of steel for the
ball bearings. Such choices can drive the price up with
a reason - disks like that are more expensive to produce -
but also increases their reliability.

In fact, I've had very few Sun disks breaking in the boxes
I've managed over 10 years; all I can remember now were
two or three 2.5" 72Gb Fujitsus with a Sun brand. Still, we
have another dozen of those running so far for several years.

So yes, I can believe that Big Vendor Brand disks can boast
huge MTBFs and prove that with a track record, and such
drives are often replaced not because of a break-down,
but rather as a precaution, and because of "moral aging",
such as low speed and small volume.

But for the rest of us (like Home-ZFS users) such numbers
of MTBF are as fantastic as the Big Vendor prices, and
inachievable for any number of reasons, starting with use
of cheaper and potentially worse hardware from the
beginning, and non-"orchard" conditions of running the
machines...

I do have some 5-year-old disks running in computers
daily and still alive, but I have about as many which died
young, sometimes even within the warranty period ;)



On another hand, I've recently seen many recommendations that in a RAIDZ* drive set, the 
number of data disks should be a power of two - so that ZFS blocks/stripes and those of 
of its users (like databases) which are inclined to use 2^N-sized blocks can be often 
accessed in a single IO burst across all drives, and not in "one and one-quarter 
IO" on the average, which might delay IOs to other stripes while some of the disks 
in a vdev are busy processing leftovers of a previous request, and others are waiting for 
their peers.

I've never heard of this and it doesn't pass the sniff test. Can you cite a 
source?

I was trying to find an "authoritative" link today but failed.
I know I've read this for many times over the past couple
of months, but this may still be an "urban legend" or even
FUD, retold many times...

In fact, today I came across old posts from Jeff Bonwick,
where he explains the disk usage and "ZFS striping" which
is not like usual RAID striping. If th

Re: [zfs-discuss] Extremely slow zpool scrub performance

2011-05-16 Thread Jim Klimov

2011-05-16 22:21, George Wilson пишет:

echo "metaslab_min_alloc_size/Z 1000" | mdb -kw


Thanks, this also boosted my home box from hundreds of kb/s into
several Mb/s range, which is much better (I'm evacuating data from
a pool hosted in a volume inside my main pool, and the bottleneck
is quite substantial) - now I'd get rid of this experiment much faster ;)


--


++
||
| Климов Евгений, Jim Klimov |
| технический директор   CTO |
| ЗАО "ЦОС и ВТ"  JSC COS&HT |
||
| +7-903-7705859 (cellular)  mailto:jimkli...@cos.ru |
|  CC:ad...@cos.ru,jimkli...@mail.ru |
++
| ()  ascii ribbon campaign - against html mail  |
| /\- against microsoft attachments  |
++



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Paul Kraus
> 
> All drives have a very high DOA rate according to Newegg. The
> way they package drives for shipping is exactly how Seagate
> specifically says NOT to pack them here

8 months ago, newegg says they've changed this practice.
http://www.facebook.com/media/set/?set=a.438146824167.223805.5585759167



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely slow zpool scrub performance

2011-05-16 Thread Donald Stahl
As a followup:

I ran the same DD test as earlier- but this time I stopped the scrub:

pool0   14.1T  25.4T 88  4.81K   709K   262M
pool0   14.1T  25.4T104  3.99K   836K   248M
pool0   14.1T  25.4T360  5.01K  2.81M   230M
pool0   14.1T  25.4T305  5.69K  2.38M   231M
pool0   14.1T  25.4T389  5.85K  3.05M   293M
pool0   14.1T  25.4T376  5.38K  2.94M   328M
pool0   14.1T  25.4T295  3.29K  2.31M   286M

~# dd if=/dev/zero of=/pool0/ds.test bs=1024k count=2000 2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 6.50394 s, 322 MB/s

Stopping the scrub seemed to increase my performance by another 60%
over the highest numbers I saw just from the metaslab change earlier
(That peak was 201 MB/s).

This is the performance I was seeing out of this array when newly built.

I have two follow up questions:

1. We changed the metaslab size from 10M to 4k- that's a pretty
drastic change. Is there some median value that should be used instead
and/or is there a downside to using such a small metaslab size?

2. I'm still confused by the poor scrub performance and it's impact on
the write performance. I'm not seeing a lot of IO's or processor load-
so I'm wondering what else I might be missing.

-Don
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Eric D. Mudama

On Mon, May 16 at 21:55, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Paul Kraus

All drives have a very high DOA rate according to Newegg. The
way they package drives for shipping is exactly how Seagate
specifically says NOT to pack them here


8 months ago, newegg says they've changed this practice.
http://www.facebook.com/media/set/?set=a.438146824167.223805.5585759167


The drives I just bought were half packed in white foam then wrapped
in bubble wrap.  Not all edges were protected with more than bubble
wrap.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss