Re: [zfs-discuss] resilver = defrag?

2010-09-10 Thread Darren J Moffat

On 10/09/2010 04:24, Bill Sommerfeld wrote:

C) Does zfs send zfs receive mean it will defrag?


Scores so far:
1 No
2 Yes


"maybe". If there is sufficient contiguous freespace in the destination
pool, files may be less fragmented.

But if you do incremental sends of multiple snapshots, you may well
replicate some or all the fragmentation on the origin (because snapshots
only copy the blocks that change, and receiving an incremental send does
the same).

And if the destination pool is short on space you may end up more
fragmented than the source.


There is yet more "it depends".

It depends on what you mean by fragmentation.

ZFS has "gang blocks", which are used when we need to store a block of 
size N but can't find a block that size but can make up that amount of 
storage from M smaller blocks that are available.


Because ZFS send|recv work at the DMU layer they know nothing about gang 
blocks, which are a ZIO layer concept.  As such if your filesystem is 
heavily "fragmented" on the source because it uses gang blocks, that 
doesn't necessarily mean it will be using gang blocks at all or of the 
same size on the destination.


I very strongly recommend the original poster take a step back and ask 
"why are you even worried about fragmentation ?" "do you know you have a 
pool that is fragmented?" "is it actually causing you a performance 
problem?"


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [mdb-discuss] mdb -k - I/O usage

2010-09-10 Thread Piotr Jasiukajtis
Ok, now I know it's not related to the I/O performance, but to the ZFS itself.

At some time all 3 pools were locked in that way:

extended device statistics    errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w
trn tot device
0.00.00.00.0  0.0  0.00.00.0   0   0   0   1
0   1 c8t0d0
0.00.00.00.0  0.0  8.00.00.0   0 100   0   0
0   0 c7t0d0
0.00.00.00.0  0.0  8.00.00.0   0 100   0   0
0   0 c7t1d0
0.00.00.00.0  0.0  4.00.00.0   0 100   0   0
0   0 c7t2d0
0.00.00.00.0  0.0  4.00.00.0   0 100   0   0
0   0 c7t3d0
0.00.00.00.0  0.0  4.00.00.0   0 100   0   0
0   0 c7t4d0
0.00.00.00.0  0.0  4.00.00.0   0 100   0   0
0   0 c7t5d0
0.00.00.00.0  0.0  4.00.00.0   0 100   0   0
0   0 c7t10d0
0.00.00.00.0  0.0  3.00.00.0   0 100   0   0
0   0 c7t11d0
^C


# zpool status
  pool: data
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
dataONLINE   0 0 0
  mirror-0  ONLINE   0 0 0
c7t2d0  ONLINE   0 0 0
c7t3d0  ONLINE   0 0 0
  mirror-1  ONLINE   0 0 0
c7t4d0  ONLINE   0 0 0
c7t5d0  ONLINE   0 0 0

errors: No known data errors

  pool: rpool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror-0ONLINE   0 0 0
c7t0d0s0  ONLINE   0 0 0
c7t1d0s0  ONLINE   0 0 0

errors: No known data errors

  pool: tmp_data
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h1m, 0.74% done, 2h21m to go
config:

NAME STATE READ WRITE CKSUM
tmp_data ONLINE   0 0 0
  mirror-0   ONLINE   0 0 0
c7t11d0  ONLINE   0 0 0
c7t10d0  ONLINE   0 0 0  2.07G resilvered

errors: No known data errors

Resilvering tmp_data is not related. I did zpool attach manually.

On Tue, Sep 7, 2010 at 12:39 PM, Piotr Jasiukajtis  wrote:
> This is snv_128 x86.
>
>> ::arc
> hits                      =  39811943
> misses                    =    630634
> demand_data_hits          =  29398113
> demand_data_misses        =    490754
> demand_metadata_hits      =  10413660
> demand_metadata_misses    =    133461
> prefetch_data_hits        =         0
> prefetch_data_misses      =         0
> prefetch_metadata_hits    =       170
> prefetch_metadata_misses  =      6419
> mru_hits                  =   2933011
> mru_ghost_hits            =     43202
> mfu_hits                  =  36878818
> mfu_ghost_hits            =     45361
> deleted                   =   1299527
> recycle_miss              =     46526
> mutex_miss                =       355
> evict_skip                =     25539
> evict_l2_cached           =         0
> evict_l2_eligible         = 77011188736
> evict_l2_ineligible       =  76253184
> hash_elements             =    278135
> hash_elements_max         =    279843
> hash_collisions           =   1653518
> hash_chains               =     75135
> hash_chain_max            =         9
> p                         =      4787 MB
> c                         =      5722 MB
> c_min                     =       715 MB
> c_max                     =      5722 MB
> size                      =      5428 MB
> hdr_size                  =  56535840
> data_size                 = 5158287360
> other_size                = 477726560
> l2_hits                   =         0
> l2_misses                 =         0
> l2_feeds                  =         0
> l2_rw_clash               =         0
> l2_read_bytes             =         0
> l2_write_bytes            =         0
> l2_writes_sent            =         0
> l2_writes_done            =         0
> l2_writes_error           =         0
> l2_writes_hdr_miss        =         0
> l2_evict_lock_retry       =         0
> l2_evict_reading          =         0
> l2_free_on_write          =         0
> l2_abort_lowmem           =         0
> l2_cksum_bad              =         0
> l2_io_error               =         0
> l2_size                   =         0
> l2_hdr_size               =         0
> memory_throttle_count     =         0

Re: [zfs-discuss] [mdb-discuss] mdb -k - I/O usage

2010-09-10 Thread Carson Gaspar

On 9/10/10 4:16 PM, Piotr Jasiukajtis wrote:

Ok, now I know it's not related to the I/O performance, but to the ZFS itself.

At some time all 3 pools were locked in that way:

 extended device statistics    errors ---
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w
trn tot device
 0.00.00.00.0  0.0  0.00.00.0   0   0   0   1
0   1 c8t0d0
 0.00.00.00.0  0.0  8.00.00.0   0 100   0   0
0   0 c7t0d0


Nope, most likely your disks or disk controller/driver. Note that you 
have 8 outstanding I/O requests that aren't being serviced. Look in your 
syslog, and I bet you'll see I/O timeout errors. I have seen this before 
with Western Digital disks attached to an LSI controller using the mpt 
driver. There was a lot of work diagnosing it, see the list archives - 
an /etc/system change fixed it for me (set xpv_psm:xen_support_msi = 
-1), but I was using a xen kernel. Note that replacing my disks with 
larger Seagate ones made the problem go away as well.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [mdb-discuss] mdb -k - I/O usage

2010-09-10 Thread Piotr Jasiukajtis
I don't have any errors from fmdump or syslog.
The machine is SUN FIRE X4275 I don't use mpt or lsi drivers.
It could be a bug in a driver since I see this on 2 the same machines.

On Fri, Sep 10, 2010 at 9:51 PM, Carson Gaspar  wrote:
> On 9/10/10 4:16 PM, Piotr Jasiukajtis wrote:
>>
>> Ok, now I know it's not related to the I/O performance, but to the ZFS
>> itself.
>>
>> At some time all 3 pools were locked in that way:
>>
>>                             extended device statistics        errors
>> ---
>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w
>> trn tot device
>>     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   1
>> 0   1 c8t0d0
>>     0.0    0.0    0.0    0.0  0.0  8.0    0.0    0.0   0 100   0   0
>> 0   0 c7t0d0
>
> Nope, most likely your disks or disk controller/driver. Note that you have 8
> outstanding I/O requests that aren't being serviced. Look in your syslog,
> and I bet you'll see I/O timeout errors. I have seen this before with
> Western Digital disks attached to an LSI controller using the mpt driver.
> There was a lot of work diagnosing it, see the list archives - an
> /etc/system change fixed it for me (set xpv_psm:xen_support_msi = -1), but I
> was using a xen kernel. Note that replacing my disks with larger Seagate
> ones made the problem go away as well.
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
Piotr Jasiukajtis | estibi | SCA OS0072
http://estseg.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [mdb-discuss] mdb -k - I/O usage

2010-09-10 Thread Richard Elling
You are both right.  More below...

On Sep 10, 2010, at 2:06 PM, Piotr Jasiukajtis wrote:

> I don't have any errors from fmdump or syslog.
> The machine is SUN FIRE X4275 I don't use mpt or lsi drivers.
> It could be a bug in a driver since I see this on 2 the same machines.
> 
> On Fri, Sep 10, 2010 at 9:51 PM, Carson Gaspar  wrote:
>> On 9/10/10 4:16 PM, Piotr Jasiukajtis wrote:
>>> 
>>> Ok, now I know it's not related to the I/O performance, but to the ZFS
>>> itself.
>>> 
>>> At some time all 3 pools were locked in that way:
>>> 
>>> extended device statistics    errors
>>> ---
>>> r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn 
>>> tot device
>>> 0.00.00.00.0  0.0  0.00.00.0   0   0   0   1  0   1 
>>> c8t0d0
>>> 0.00.00.00.0  0.0  8.00.00.0   0 100   0   0  0   0 
>>> c7t0d0
>> 
>> Nope, most likely your disks or disk controller/driver. Note that you have 8
>> outstanding I/O requests that aren't being serviced. Look in your syslog,
>> and I bet you'll see I/O timeout errors. I have seen this before with
>> Western Digital disks attached to an LSI controller using the mpt driver.
>> There was a lot of work diagnosing it, see the list archives - an
>> /etc/system change fixed it for me (set xpv_psm:xen_support_msi = -1), but I
>> was using a xen kernel. Note that replacing my disks with larger Seagate
>> ones made the problem go away as well.

In this case, the diagnosis that I/Os are stuck at the drive, not being 
serviced is
correct.  This is clearly visible as actv>0, asvc_t==0, and the derived %b == 
100%
However, the error reports are also 0 for the affected devices: s/w, h/w, and 
trn.
In many cases where we see I/O timeouts and devices aborting commands, we
will see these logged as transport (trn) errors.  For iostat, these errors are 
reported
as since-boot, not per-sample period, so we know that whatever is getting stuck
isn't getting unstuck.  The symptom we see with questionable devices in the
HBA-to-disk path is hundreds, thousands, or millions of transport errors 
reported.

Next question: what does the software stack look like?  I knew the sd driver 
intimately at one time (pictures were in the Enquirer :-) and it will retry and 
send resets that will ultimately get logged.  In this case, we know that at 
least one hard error was returned for c8t0d0, so there is a ereport somewhere 
with the details, try "fmdump -eV"

This is not a ZFS bug and cannot be fixed at the ZFS layer.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] performance leakage when copy huge data

2010-09-10 Thread Richard Elling
On Sep 9, 2010, at 5:55 PM, Fei Xu wrote:
> Just to update the status and findings.

Thanks for the update.

> I've checked TLER settings and they are off by default.
> 
> I moved the source pool to another chassis and do the 3.8TB send again.  this 
> time, not any problems!  the difference is 
> 1. New chassis

Can you describe the old and new chassis in detail?  Model numbers?

> 2. BIGGER memory.  32GB v.s 12GB

It is not a memory issue.

> 3. although wdidle time is disabled by default, I've change the HD mode from 
> silent to performance in HDtune.  this is what I once heard from some website 
> that might also fix the disk head park/unpark issue (aka, C1).

Not a bad idea.

> seems TLER is not the root cause or at least, set to off is ok.

Definitely not a TLER issue.

> my next step will be 
> 1. move back HD to see if it's the "performance mode" fix the issue
> 2. if not, add more memory and try again.

It is not a memory issue.

> by the way, in HDtune, I saw C7: Ultra DMA CRC error count is a little high 
> which indicates a potential connection issue.  Maybe all are caused by the 
> enclosure?

Bingo!
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-10 Thread Richard Elling
On Sep 9, 2010, at 6:39 AM, Marty Scholes wrote:

> Erik wrote:
>> Actually, your biggest bottleneck will be the IOPS
>> limits of the 
>> drives.  A 7200RPM SATA drive tops out at 100 IOPS.
>> Yup. That's it.
>> So, if you need to do 62.5e6 IOPS, and the rebuild
>> drive can do just 100 
>> IOPS, that means you will finish (best case) in
>> 62.5e4 seconds.  Which 
>> is over 173 hours. Or, about 7.25 WEEKS.
> 
> My OCD is coming out and I will split that hair with you.  173 hours is just 
> over a week.
> 
> This is a fascinating and timely discussion.  My personal (biased and 
> unhindered by facts) preference is wide stripes RAIDZ3.  Ned is right that I 
> kept reading that RAIDZx should not exceed _ devices and couldn't find real 
> numbers behind those conclusions.

There isn't a real number.  We know that a 46-disk raidz stripe is a recipe for 
unhappiness (because people actually tried that when the thumper was released)
And we know that a 2-disk raidz1 is kinda like mirroring -- a hard sell.  So we 
had
to find a number that was between the two, somewhere in the realm of reasonable.

> Discussions in this thread have opened my eyes a little and I am in the 
> middle of deploying a second 22 disk fibre array on home server, so I have 
> been struggling with the best way to allocate pools.  

Simple, mirror it and be happy :-).

> Up until reading this thread, the biggest downside to wide stripes, that I 
> was aware of, has been low iops.  And let's be clear: while on paper the iops 
> of a wide stripe is the same as a single disk, it actually is worse.  In 
> truth, the service time for any request on wide stripe is the service time of 
> the SLOWEST disk for that request.  The slowest disk may vary from request to 
> request, but will always delay the entire stripe operation.

Yes, but this is not a problem for async writes, so it will depend on the 
workload.

> Since all of the 44 spindles are 15K disks, I am about to convince myself to 
> go with two pools of wide stripes and keep several spindles for L2ARC and 
> SLOG.  The thinking is that other background operations (scrub and resilver) 
> can take place with little impact to application performance, since those 
> will be using L2ARC and SLOG.
> 
> Of course, I could be wrong on any of the above.

If you get it wrong, you can reconfigure most things on the fly.  Except you 
can't
add columns to a raidz or shrink. A good strategy is to start with what you need
and add disks as capacity requires.  Oh, and by the way, the easiest way to do
that is with mirrors :-)  But if you insist on raidz, then consider something 
like 
6-way or 8-way sets because that is the typical denominator for most hardware
trays today.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
ZFS and performance consulting
http://www.RichardElling.com












___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] performance leakage when copy huge data

2010-09-10 Thread Fei Xu
> > by the way, in HDtune, I saw C7: Ultra DMA CRC
> error count is a little high which indicates a
> potential connection issue.  Maybe all are caused by
> the enclosure?
> 
> Bingo!


You are right, I've done a lot of tests and the defect is narrorw down the  
"problem hardware".  The two pool works fine in one chassis but after moved to 
original enclosure, it just failed when CP or ZFS send.  I also noticed when 
the machine bootup reading ZFS configure, there is a warning message.

" REading ZFS config:" *Warning" /p...@0,0/pci8086,3...@8/pci15d9,1...@0(mpt0):
Discovery in progress, can't verify IO unit config.

I did search a lot but cannot find more details.  
my 2 server configuration:
1. "PRoblem chassis"   supermicro SuperChassis847e2.  Tysonberg MB with onboard 
LSI 1068e (IT mode, which direct expose HD to system without RAID), Single 
Xeon5520.
2. "Good Chassis":Self-developed chassis by other department.  S5000WB MB, 
single E5504, 2 PCIe-4x LSI 3081 HBA card.

Seems the SAS cable are all connecting right.  I suspect the issue of onboard 
1068e and moving the LSI3081 card to the "problem" server to test.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Solaris 10u9 with zpool version 22, but no DEDUP (version 21 reserved)

2010-09-10 Thread Hans Foertsch
bash-3.00# uname -a
SunOS testxx10 5.10 Generic_142910-17 i86pc i386 i86pc

bash-3.00# zpool upgrade -v
This system is currently running ZFS pool version 22.

The following versions are supported:

VER  DESCRIPTION
---  
 1   Initial ZFS version
 2   Ditto blocks (replicated metadata)
 3   Hot spares and double parity RAID-Z
 4   zpool history
 5   Compression using the gzip algorithm
 6   bootfs pool property
 7   Separate intent log devices
 8   Delegated administration
 9   refquota and refreservation properties
 10  Cache devices
 11  Improved scrub performance
 12  Snapshot properties
 13  snapused property
 14  passthrough-x aclinherit
 15  user/group space accounting
 16  stmf property support
 17  Triple-parity RAID-Z
 18  Snapshot user holds
 19  Log device removal
 20  Compression using zle (zero-length encoding)
 21  Reserved
 22  Received properties

For more information on a particular version, including supported releases,
see the ZFS Administration Guide.

this is an interesting condition..

What, if you will use Zpools created with OSOL and Dedup on Solaris 10u9

Hans Foertsch
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss