Re: [zfs-discuss] resilver = defrag?
On 10/09/2010 04:24, Bill Sommerfeld wrote: C) Does zfs send zfs receive mean it will defrag? Scores so far: 1 No 2 Yes "maybe". If there is sufficient contiguous freespace in the destination pool, files may be less fragmented. But if you do incremental sends of multiple snapshots, you may well replicate some or all the fragmentation on the origin (because snapshots only copy the blocks that change, and receiving an incremental send does the same). And if the destination pool is short on space you may end up more fragmented than the source. There is yet more "it depends". It depends on what you mean by fragmentation. ZFS has "gang blocks", which are used when we need to store a block of size N but can't find a block that size but can make up that amount of storage from M smaller blocks that are available. Because ZFS send|recv work at the DMU layer they know nothing about gang blocks, which are a ZIO layer concept. As such if your filesystem is heavily "fragmented" on the source because it uses gang blocks, that doesn't necessarily mean it will be using gang blocks at all or of the same size on the destination. I very strongly recommend the original poster take a step back and ask "why are you even worried about fragmentation ?" "do you know you have a pool that is fragmented?" "is it actually causing you a performance problem?" -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [mdb-discuss] mdb -k - I/O usage
Ok, now I know it's not related to the I/O performance, but to the ZFS itself. At some time all 3 pools were locked in that way: extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.00.00.00.0 0.0 0.00.00.0 0 0 0 1 0 1 c8t0d0 0.00.00.00.0 0.0 8.00.00.0 0 100 0 0 0 0 c7t0d0 0.00.00.00.0 0.0 8.00.00.0 0 100 0 0 0 0 c7t1d0 0.00.00.00.0 0.0 4.00.00.0 0 100 0 0 0 0 c7t2d0 0.00.00.00.0 0.0 4.00.00.0 0 100 0 0 0 0 c7t3d0 0.00.00.00.0 0.0 4.00.00.0 0 100 0 0 0 0 c7t4d0 0.00.00.00.0 0.0 4.00.00.0 0 100 0 0 0 0 c7t5d0 0.00.00.00.0 0.0 4.00.00.0 0 100 0 0 0 0 c7t10d0 0.00.00.00.0 0.0 3.00.00.0 0 100 0 0 0 0 c7t11d0 ^C # zpool status pool: data state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM dataONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0ONLINE 0 0 0 c7t0d0s0 ONLINE 0 0 0 c7t1d0s0 ONLINE 0 0 0 errors: No known data errors pool: tmp_data state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h1m, 0.74% done, 2h21m to go config: NAME STATE READ WRITE CKSUM tmp_data ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c7t11d0 ONLINE 0 0 0 c7t10d0 ONLINE 0 0 0 2.07G resilvered errors: No known data errors Resilvering tmp_data is not related. I did zpool attach manually. On Tue, Sep 7, 2010 at 12:39 PM, Piotr Jasiukajtis wrote: > This is snv_128 x86. > >> ::arc > hits = 39811943 > misses = 630634 > demand_data_hits = 29398113 > demand_data_misses = 490754 > demand_metadata_hits = 10413660 > demand_metadata_misses = 133461 > prefetch_data_hits = 0 > prefetch_data_misses = 0 > prefetch_metadata_hits = 170 > prefetch_metadata_misses = 6419 > mru_hits = 2933011 > mru_ghost_hits = 43202 > mfu_hits = 36878818 > mfu_ghost_hits = 45361 > deleted = 1299527 > recycle_miss = 46526 > mutex_miss = 355 > evict_skip = 25539 > evict_l2_cached = 0 > evict_l2_eligible = 77011188736 > evict_l2_ineligible = 76253184 > hash_elements = 278135 > hash_elements_max = 279843 > hash_collisions = 1653518 > hash_chains = 75135 > hash_chain_max = 9 > p = 4787 MB > c = 5722 MB > c_min = 715 MB > c_max = 5722 MB > size = 5428 MB > hdr_size = 56535840 > data_size = 5158287360 > other_size = 477726560 > l2_hits = 0 > l2_misses = 0 > l2_feeds = 0 > l2_rw_clash = 0 > l2_read_bytes = 0 > l2_write_bytes = 0 > l2_writes_sent = 0 > l2_writes_done = 0 > l2_writes_error = 0 > l2_writes_hdr_miss = 0 > l2_evict_lock_retry = 0 > l2_evict_reading = 0 > l2_free_on_write = 0 > l2_abort_lowmem = 0 > l2_cksum_bad = 0 > l2_io_error = 0 > l2_size = 0 > l2_hdr_size = 0 > memory_throttle_count = 0
Re: [zfs-discuss] [mdb-discuss] mdb -k - I/O usage
On 9/10/10 4:16 PM, Piotr Jasiukajtis wrote: Ok, now I know it's not related to the I/O performance, but to the ZFS itself. At some time all 3 pools were locked in that way: extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.00.00.00.0 0.0 0.00.00.0 0 0 0 1 0 1 c8t0d0 0.00.00.00.0 0.0 8.00.00.0 0 100 0 0 0 0 c7t0d0 Nope, most likely your disks or disk controller/driver. Note that you have 8 outstanding I/O requests that aren't being serviced. Look in your syslog, and I bet you'll see I/O timeout errors. I have seen this before with Western Digital disks attached to an LSI controller using the mpt driver. There was a lot of work diagnosing it, see the list archives - an /etc/system change fixed it for me (set xpv_psm:xen_support_msi = -1), but I was using a xen kernel. Note that replacing my disks with larger Seagate ones made the problem go away as well. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [mdb-discuss] mdb -k - I/O usage
I don't have any errors from fmdump or syslog. The machine is SUN FIRE X4275 I don't use mpt or lsi drivers. It could be a bug in a driver since I see this on 2 the same machines. On Fri, Sep 10, 2010 at 9:51 PM, Carson Gaspar wrote: > On 9/10/10 4:16 PM, Piotr Jasiukajtis wrote: >> >> Ok, now I know it's not related to the I/O performance, but to the ZFS >> itself. >> >> At some time all 3 pools were locked in that way: >> >> extended device statistics errors >> --- >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w >> trn tot device >> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 1 >> 0 1 c8t0d0 >> 0.0 0.0 0.0 0.0 0.0 8.0 0.0 0.0 0 100 0 0 >> 0 0 c7t0d0 > > Nope, most likely your disks or disk controller/driver. Note that you have 8 > outstanding I/O requests that aren't being serviced. Look in your syslog, > and I bet you'll see I/O timeout errors. I have seen this before with > Western Digital disks attached to an LSI controller using the mpt driver. > There was a lot of work diagnosing it, see the list archives - an > /etc/system change fixed it for me (set xpv_psm:xen_support_msi = -1), but I > was using a xen kernel. Note that replacing my disks with larger Seagate > ones made the problem go away as well. > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Piotr Jasiukajtis | estibi | SCA OS0072 http://estseg.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [mdb-discuss] mdb -k - I/O usage
You are both right. More below... On Sep 10, 2010, at 2:06 PM, Piotr Jasiukajtis wrote: > I don't have any errors from fmdump or syslog. > The machine is SUN FIRE X4275 I don't use mpt or lsi drivers. > It could be a bug in a driver since I see this on 2 the same machines. > > On Fri, Sep 10, 2010 at 9:51 PM, Carson Gaspar wrote: >> On 9/10/10 4:16 PM, Piotr Jasiukajtis wrote: >>> >>> Ok, now I know it's not related to the I/O performance, but to the ZFS >>> itself. >>> >>> At some time all 3 pools were locked in that way: >>> >>> extended device statistics errors >>> --- >>> r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn >>> tot device >>> 0.00.00.00.0 0.0 0.00.00.0 0 0 0 1 0 1 >>> c8t0d0 >>> 0.00.00.00.0 0.0 8.00.00.0 0 100 0 0 0 0 >>> c7t0d0 >> >> Nope, most likely your disks or disk controller/driver. Note that you have 8 >> outstanding I/O requests that aren't being serviced. Look in your syslog, >> and I bet you'll see I/O timeout errors. I have seen this before with >> Western Digital disks attached to an LSI controller using the mpt driver. >> There was a lot of work diagnosing it, see the list archives - an >> /etc/system change fixed it for me (set xpv_psm:xen_support_msi = -1), but I >> was using a xen kernel. Note that replacing my disks with larger Seagate >> ones made the problem go away as well. In this case, the diagnosis that I/Os are stuck at the drive, not being serviced is correct. This is clearly visible as actv>0, asvc_t==0, and the derived %b == 100% However, the error reports are also 0 for the affected devices: s/w, h/w, and trn. In many cases where we see I/O timeouts and devices aborting commands, we will see these logged as transport (trn) errors. For iostat, these errors are reported as since-boot, not per-sample period, so we know that whatever is getting stuck isn't getting unstuck. The symptom we see with questionable devices in the HBA-to-disk path is hundreds, thousands, or millions of transport errors reported. Next question: what does the software stack look like? I knew the sd driver intimately at one time (pictures were in the Enquirer :-) and it will retry and send resets that will ultimately get logged. In this case, we know that at least one hard error was returned for c8t0d0, so there is a ereport somewhere with the details, try "fmdump -eV" This is not a ZFS bug and cannot be fixed at the ZFS layer. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] performance leakage when copy huge data
On Sep 9, 2010, at 5:55 PM, Fei Xu wrote: > Just to update the status and findings. Thanks for the update. > I've checked TLER settings and they are off by default. > > I moved the source pool to another chassis and do the 3.8TB send again. this > time, not any problems! the difference is > 1. New chassis Can you describe the old and new chassis in detail? Model numbers? > 2. BIGGER memory. 32GB v.s 12GB It is not a memory issue. > 3. although wdidle time is disabled by default, I've change the HD mode from > silent to performance in HDtune. this is what I once heard from some website > that might also fix the disk head park/unpark issue (aka, C1). Not a bad idea. > seems TLER is not the root cause or at least, set to off is ok. Definitely not a TLER issue. > my next step will be > 1. move back HD to see if it's the "performance mode" fix the issue > 2. if not, add more memory and try again. It is not a memory issue. > by the way, in HDtune, I saw C7: Ultra DMA CRC error count is a little high > which indicates a potential connection issue. Maybe all are caused by the > enclosure? Bingo! -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
On Sep 9, 2010, at 6:39 AM, Marty Scholes wrote: > Erik wrote: >> Actually, your biggest bottleneck will be the IOPS >> limits of the >> drives. A 7200RPM SATA drive tops out at 100 IOPS. >> Yup. That's it. >> So, if you need to do 62.5e6 IOPS, and the rebuild >> drive can do just 100 >> IOPS, that means you will finish (best case) in >> 62.5e4 seconds. Which >> is over 173 hours. Or, about 7.25 WEEKS. > > My OCD is coming out and I will split that hair with you. 173 hours is just > over a week. > > This is a fascinating and timely discussion. My personal (biased and > unhindered by facts) preference is wide stripes RAIDZ3. Ned is right that I > kept reading that RAIDZx should not exceed _ devices and couldn't find real > numbers behind those conclusions. There isn't a real number. We know that a 46-disk raidz stripe is a recipe for unhappiness (because people actually tried that when the thumper was released) And we know that a 2-disk raidz1 is kinda like mirroring -- a hard sell. So we had to find a number that was between the two, somewhere in the realm of reasonable. > Discussions in this thread have opened my eyes a little and I am in the > middle of deploying a second 22 disk fibre array on home server, so I have > been struggling with the best way to allocate pools. Simple, mirror it and be happy :-). > Up until reading this thread, the biggest downside to wide stripes, that I > was aware of, has been low iops. And let's be clear: while on paper the iops > of a wide stripe is the same as a single disk, it actually is worse. In > truth, the service time for any request on wide stripe is the service time of > the SLOWEST disk for that request. The slowest disk may vary from request to > request, but will always delay the entire stripe operation. Yes, but this is not a problem for async writes, so it will depend on the workload. > Since all of the 44 spindles are 15K disks, I am about to convince myself to > go with two pools of wide stripes and keep several spindles for L2ARC and > SLOG. The thinking is that other background operations (scrub and resilver) > can take place with little impact to application performance, since those > will be using L2ARC and SLOG. > > Of course, I could be wrong on any of the above. If you get it wrong, you can reconfigure most things on the fly. Except you can't add columns to a raidz or shrink. A good strategy is to start with what you need and add disks as capacity requires. Oh, and by the way, the easiest way to do that is with mirrors :-) But if you insist on raidz, then consider something like 6-way or 8-way sets because that is the typical denominator for most hardware trays today. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] performance leakage when copy huge data
> > by the way, in HDtune, I saw C7: Ultra DMA CRC > error count is a little high which indicates a > potential connection issue. Maybe all are caused by > the enclosure? > > Bingo! You are right, I've done a lot of tests and the defect is narrorw down the "problem hardware". The two pool works fine in one chassis but after moved to original enclosure, it just failed when CP or ZFS send. I also noticed when the machine bootup reading ZFS configure, there is a warning message. " REading ZFS config:" *Warning" /p...@0,0/pci8086,3...@8/pci15d9,1...@0(mpt0): Discovery in progress, can't verify IO unit config. I did search a lot but cannot find more details. my 2 server configuration: 1. "PRoblem chassis" supermicro SuperChassis847e2. Tysonberg MB with onboard LSI 1068e (IT mode, which direct expose HD to system without RAID), Single Xeon5520. 2. "Good Chassis":Self-developed chassis by other department. S5000WB MB, single E5504, 2 PCIe-4x LSI 3081 HBA card. Seems the SAS cable are all connecting right. I suspect the issue of onboard 1068e and moving the LSI3081 card to the "problem" server to test. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Solaris 10u9 with zpool version 22, but no DEDUP (version 21 reserved)
bash-3.00# uname -a SunOS testxx10 5.10 Generic_142910-17 i86pc i386 i86pc bash-3.00# zpool upgrade -v This system is currently running ZFS pool version 22. The following versions are supported: VER DESCRIPTION --- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit 15 user/group space accounting 16 stmf property support 17 Triple-parity RAID-Z 18 Snapshot user holds 19 Log device removal 20 Compression using zle (zero-length encoding) 21 Reserved 22 Received properties For more information on a particular version, including supported releases, see the ZFS Administration Guide. this is an interesting condition.. What, if you will use Zpools created with OSOL and Dedup on Solaris 10u9 Hans Foertsch -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss