[zfs-discuss] zpool replace complete but old drives not detached
$ cat /etc/release Solaris Express Community Edition snv_114 X86 Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 04 May 2009 I recently replaced two drives in a raidz2 vdev. However, after the resilver completed, the old drives were not automatically detached. Why? How do I detach the drives that were replaced? # zpool replace tww c6t600A0B800029996605B04668F17Dd0 \ c6t600A0B8000299CCC099B4A400A9Cd0 # zpool replace tww c6t600A0B800029996605C24668F39Bd0 \ c6t600A0B8000299CCC0A744A94F7E2d0 ... resilver runs to completion ... # zpool status tww pool: tww state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: resilver completed after 25h11m with 23375 errors on Sun Sep 6 02:09:07 2009 config: NAME STATE READ WRITE CKSUM tww DEGRADED 0 0 207K raidz2 ONLINE 0 0 0 c6t600A0B800029996605964668CB39d0ONLINE 0 0 0 c6t600A0B8000299CCC06C84744C892d0ONLINE 0 0 0 c6t600A0B8000299CCC05B44668CC6Ad0ONLINE 0 0 0 c6t600A0B800029996605A44668CC3Fd0ONLINE 0 0 0 c6t600A0B8000299CCC05BA4668CD2Ed0ONLINE 0 0 0 c6t600A0B800029996605AA4668CDB1d0ONLINE 0 0 0 c6t600A0B8000299966073547C5CED9d0ONLINE 0 0 0 raidz2 DEGRADED 0 0 182K replacingDEGRADED 0 0 0 c6t600A0B800029996605B04668F17Dd0 DEGRADED 0 0 0 too many errors c6t600A0B8000299CCC099B4A400A9Cd0 ONLINE 0 0 0 255G resilvered c6t600A0B8000299CCC099E4A400B94d0ONLINE 0 0 218K 10.2M resilvered c6t600A0B8000299CCC0A6B4A93D3EEd0ONLINE 0 0 242 246G resilvered spareDEGRADED 0 0 0 c6t600A0B8000299CCC05CC4668F30Ed0 DEGRADED 0 0 3 too many errors c6t600A0B8000299CCC05D84668F448d0 ONLINE 0 0 0 255G resilvered spareDEGRADED 0 0 0 c6t600A0B800029996605BC4668F305d0 DEGRADED 0 0 0 too many errors c6t600A0B800029996605C84668F461d0 ONLINE 0 0 0 255G resilvered c6t600A0B800029996609EE4A89DA51d0ONLINE 0 0 0 246G resilvered replacingDEGRADED 0 0 0 c6t600A0B800029996605C24668F39Bd0 DEGRADED 0 0 0 too many errors c6t600A0B8000299CCC0A744A94F7E2d0 ONLINE 0 0 0 255G resilvered raidz2 ONLINE 0 0 233K c6t600A0B8000299CCC0A154A89E426d0ONLINE 0 0 0 c6t600A0B800029996609F74A89E1A5d0ONLINE 0 0 758 6.50K resilvered c6t600A0B8000299CCC0A174A89E520d0ONLINE 0 0 311 3.50K resilvered c6t600A0B800029996609F94A89E24Bd0ONLINE 0 0 21.8K 32K resilvered c6t600A0B8000299CCC0A694A93D322d0ONLINE 0 0 0 1.85G resilvered c6t600A0B8000299CCC0A0C4A89DDE8d0ONLINE 0 0 27.4K 41.5K resilvered c6t600A0B800029996609F04A89DB1Bd0ONLINE 0 0 7.13K 24K resilvered spares c6t600A0B8000299CCC05D84668F448d0 INUSE currently in use c6t600A0B800029996605C84668F461d0 INUSE currently in use c6t600A0B80002999660A454A93CEDBd0 AVAIL c6t600A0B80002999660ADA4A9CF2EDd0 AVAIL -- albert chin (ch...@thewrittenword.com) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] periodic slow responsiveness
I’m experiencing occasional slow responsiveness on an OpenSolaris b118 system typically noticed when running an ‘ls’ (no extra flags, so no directory service lookups). There is a delay of between 2 and 30 seconds but no correlation has been noticed with load on the server and the slow return. This problem has only been noticed via NFS (v3. We are migrating to NFSv4 once the O_EXCL/mtime bug fix has been integrated - anticipated for snv_124). The problem has been observed both locally on the primary filesystem, in an locally automounted reference (/home/foo) and remotely via NFS. zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i). The system is configured as an NFS (currently serving NFSv3), iSCSI (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) with authentication taking place from a remote openLDAP server. Automount is in use both locally and remotely (linux clients). Locally /home/* is remounted from the zpool, remotely /home and another filesystem (and children) are mounted using autofs. There was some suspicion that automount is the problem, but no definitive evidence as of yet. The problem has definitely been observed with stats (of some form, typically ‘/usr/bin/ls’ output) both remotely, locally in /home/* and locally in /zpool/home/* (the true source location). There is a clear correlation with recency of reads of the directories in question and reoccurrence of the fault in that one user has scripted a regular (15m/ 30m/hourly tests so far) ‘ls’ of the filesystems of interested and this has reduced the fault to have minimal noted impact since starting down this path (just for themself). I have removed the l2arc(s) (cache devices) from the pool and the same behaviour has been observed. My suspicion here was that there was perhaps occasional high synchronous load causing heavy writes to the slog devices and when a stat was requested it may have been faulting from ARC to L2ARC prior to going to the primary data store. The slowness has been reported since removing the extra cache devices. Another thought I had was along the lines of fileystem caching and heavy writes causing read blocking. I have no evidence that this is the case, but some suggestions on list recently of limiting the ZFS memory usage for write caching. Can anybody comment to the effectiveness of this (I have 256MB write cache in front of the slog SSDs and 512MB in front of the primary storage devices). My DTrace is very poor but I’m suspicious that this is the best way to root cause this problem. If somebody has any code that may assist in debugging this problem and was able to share it would much appreciated. Any other suggestions for how to identify this fault and work around it would be greatly appreciated. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On Sun, Sep 6, 2009 at 9:15 AM, James Lever wrote: > I’m experiencing occasional slow responsiveness on an OpenSolaris b118 > system typically noticed when running an ‘ls’ (no extra flags, so no > directory service lookups). There is a delay of between 2 and 30 seconds > but no correlation has been noticed with load on the server and the slow > return. This problem has only been noticed via NFS (v3. We are migrating > to NFSv4 once the O_EXCL/mtime bug fix has been integrated - anticipated for > snv_124). The problem has been observed both locally on the primary > filesystem, in an locally automounted reference (/home/foo) and remotely via > NFS. > > zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078 w/ > 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with 2x SSDs > each partitioned as 10GB slog and 36GB remainder as l2arc behind another LSI > 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i). > > The system is configured as an NFS (currently serving NFSv3), iSCSI > (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) with > authentication taking place from a remote openLDAP server. > > Automount is in use both locally and remotely (linux clients). Locally > /home/* is remounted from the zpool, remotely /home and another filesystem > (and children) are mounted using autofs. There was some suspicion that > automount is the problem, but no definitive evidence as of yet. > > The problem has definitely been observed with stats (of some form, typically > ‘/usr/bin/ls’ output) both remotely, locally in /home/* and locally in > /zpool/home/* (the true source location). There is a clear correlation with > recency of reads of the directories in question and reoccurrence of the > fault in that one user has scripted a regular (15m/30m/hourly tests so far) > ‘ls’ of the filesystems of interested and this has reduced the fault to have > minimal noted impact since starting down this path (just for themself). > > I have removed the l2arc(s) (cache devices) from the pool and the same > behaviour has been observed. My suspicion here was that there was perhaps > occasional high synchronous load causing heavy writes to the slog devices > and when a stat was requested it may have been faulting from ARC to L2ARC > prior to going to the primary data store. The slowness has been reported > since removing the extra cache devices. > > Another thought I had was along the lines of fileystem caching and heavy > writes causing read blocking. I have no evidence that this is the case, but > some suggestions on list recently of limiting the ZFS memory usage for write > caching. Can anybody comment to the effectiveness of this (I have 256MB > write cache in front of the slog SSDs and 512MB in front of the primary > storage devices). > > My DTrace is very poor but I’m suspicious that this is the best way to root > cause this problem. If somebody has any code that may assist in debugging > this problem and was able to share it would much appreciated. > > Any other suggestions for how to identify this fault and work around it > would be greatly appreciated. That behavior sounds a lot like a process has a memory leak and is filling the VM. On Linux there is an OOM killer for these, but on OpenSolaris, your the OOM killer. You have iSCSI, NFS, CIFS to choose from (most obvious), try restarting them one at a time during down time and see if performance improves after each restart to find the culprit. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Yet another "where did the space go" question
An attempt to pkg image-update from snv111b to snv122 failed miserably for a number of reasons which are probably out of scope here. Suffice it to say that it ran out of disk space after the third attempt. Before starting, I was careful to make a baseline snapshot, but rolling back to that snapshot has not freed up all the space - this on a small disk dedicated to experimenting with ZFS booting on SPARC. The disk is nominally 20GB. After "zfs rollback -rR rpool/ROOT/opensola...@baseline" from a different BE (snv103 booted from UFS) # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT rpool 17.5G 10.1G 7.39G57% ONLINE - space 1.36T 314G 1.05T22% ONLINE - # zfs list -r -o space rpool NAMEAVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD rpool 7.11G 10.1G 0 20K 0 10.1G rpool/ROOT 7.11G 10.1G 0 18K 0 10.1G rpool/ROOT/opensolaris 7.11G 10.1G 942K 10.0G 0 68.6M rpool/ROOT/opensolaris/opt 7.11G 68.6M 0 68.6M 0 0 Before the aborted pkg image-updates, the rpool took around 6GB, so 4GB has vanished somewhere. Even if pkg put it's updates in a well hidden place (there are no hidden directories in / ), surely the rollback should have deleted them. # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT rp...@baseline 0 -20K - rpool/r...@baseline 0 -18K - rpool/ROOT/opensola...@baseline 718K - 10.0G - rpool/ROOT/opensolaris/o...@baseline 0 - 68.6M - The rollback obviously worked because afterwards even the pkg set-publisher changes were gone, and other post-snapshot files were deleted. If the worst come to the worst I could obviously save the snapshot to a file and then restore it, but it sure would be nice to know where the 4GB went. BTW one image-update failure occurred because there was an X86 rpool mounted to an alternate root, and pkg somehow found it and seemed to get confused about X86 vs. SPARC, insisting on trying to create a menu.lst in /rpool/boot, which, of course, doesn't exist on SPARC. I suppose this should be a bug... Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Recover after zpool add -f ? Is it possible?
Hello. Recently I've upgraded one of my machines to OpenSolaris 2009.06; the box has a few HDD: 1) main with OS, etc and 2) archive and ... After installing I created users, environment, etc on HDD(1), and then wanted to mount the existing ZFS for the rest (from (2), etc). And by a mistake, for (2) I did zpool add -f space c9d0. That's it. "space" pool is empty. zpool import -D shows nothing. zpool list shows for this drive space 696G 76K 696G 0% ONLINE - So. The question is simple. Is it possible to recover it (I mean that what was on c9d0 before I did zpool add)? Thanks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Wikipedia on ZFS
The "limitations" section of the Wikipedia article on ZFS currently includes the statement: "You cannot mix vdev types in a zpool. For example, if you had a striped ZFS pool consisting of disks on a SAN, you cannot add the local-disks as a mirrored vdev." As I understand it, this is simply wrong. You can add any kind of vdev to any zpool. Right? As I understand it, mixing vdev types is in general a bad idea. The reliability of a zpool is dictated by its least reliable vdev and the performance of a zpool tends to be limited by its lowest-performing vdevs. So mixing vdev types tends to give you the worst of all possible worlds. But this is merely a logical consequence of mixing storage types, not anything to do with ZFS. Precisely analogous considerations would apply when setting up RAID-10, for example. Do I have that right? Cheers, Al. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wikipedia on ZFS
you can't without forcing...and even if you do, it's not a good idea. if you do your pool will have the lowest redundancy/speed of the worst one... sooo if you add a single drive to a raidz vdev, your entire pool loses it's redundancy, if you add a single drive to a mirror, likewise if you add a raidz group to a group of 3 mirrors, the entire pool slows down to the speed of the raidz. while you technically CAN do it, it's a horrible idea. On Sun, Sep 6, 2009 at 2:19 PM, Al Lang wrote: > The "limitations" section of the Wikipedia article on ZFS currently > includes the statement: > >"You cannot mix vdev types in a zpool. For example, if you had a striped > ZFS pool consisting of disks on a SAN, you cannot add the local-disks as a > mirrored vdev." > > As I understand it, this is simply wrong. You can add any kind of vdev to > any zpool. Right? > > As I understand it, mixing vdev types is in general a bad idea. The > reliability of a zpool is dictated by its least reliable vdev and the > performance of a zpool tends to be limited by its lowest-performing vdevs. > So mixing vdev types tends to give you the worst of all possible worlds. > > But this is merely a logical consequence of mixing storage types, not > anything to do with ZFS. Precisely analogous considerations would apply when > setting up RAID-10, for example. > > Do I have that right? > > Cheers, > Al. > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wikipedia on ZFS
if you add a raidz group to a group of 3 mirrors, the entire pool slows down to the speed of the raidz. That's not true. Blocks are being randomly spread across all vdevs. Unless all requests keep pulling blocks from the RAID-Z, the speed is a mean of the performance of all vdevs. -mg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wikipedia on ZFS
yes, but it stripes across the vdevs, and when it needs to read data back, it will absolutely be limited. On Sun, Sep 6, 2009 at 3:14 PM, Mario Goebbels wrote: > if you add a raidz group to a group of 3 mirrors, the entire pool slows >> down to the speed of the raidz. >> > > That's not true. Blocks are being randomly spread across all vdevs. Unless > all requests keep pulling blocks from the RAID-Z, the speed is a mean of the > performance of all vdevs. > > -mg > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wikipedia on ZFS
On Sep 6, 2009, at 3:32 PM, Thomas Burgess wrote: yes, but it stripes across the vdevs, and when it needs to read data back, it will absolutely be limited. During reads the raidz will be the fastest vdev, during writes it should have about the same write performance as any one mirror vdev depending on how many disks it's comprised of. Random io on raidz should perform equal to any one mirror vdev except be a little faster on reads. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On Sep 6, 2009, at 7:53 AM, Ross Walker wrote: On Sun, Sep 6, 2009 at 9:15 AM, James Lever wrote: I’m experiencing occasional slow responsiveness on an OpenSolaris b118 system typically noticed when running an ‘ls’ (no extra flags, so no directory service lookups). There is a delay of between 2 and 30 seconds but no correlation has been noticed with load on the server and the slow return. This problem has only been noticed via NFS (v3. We are migrating to NFSv4 once the O_EXCL/mtime bug fix has been integrated - anticipated for snv_124). The problem has been observed both locally on the primary filesystem, in an locally automounted reference (/home/foo) and remotely via NFS. I'm confused. If "This problem has only been noticed via NFS (v3" then how is it "observed locally?" zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i). The system is configured as an NFS (currently serving NFSv3), iSCSI (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) with authentication taking place from a remote openLDAP server. Automount is in use both locally and remotely (linux clients). Locally /home/* is remounted from the zpool, remotely /home and another filesystem (and children) are mounted using autofs. There was some suspicion that automount is the problem, but no definitive evidence as of yet. The problem has definitely been observed with stats (of some form, typically ‘/usr/bin/ls’ output) both remotely, locally in /home/* and locally in /zpool/home/* (the true source location). There is a clear correlation with recency of reads of the directories in question and reoccurrence of the fault in that one user has scripted a regular (15m/30m/hourly tests so far) ‘ls’ of the filesystems of interested and this has reduced the fault to have minimal noted impact since starting down this path (just for themself). iostat(1m) is the program for troubleshooting performance issues related to latency. It will show the latency of nfs mounts as well as other devices. I have removed the l2arc(s) (cache devices) from the pool and the same behaviour has been observed. My suspicion here was that there was perhaps occasional high synchronous load causing heavy writes to the slog devices and when a stat was requested it may have been faulting from ARC to L2ARC prior to going to the primary data store. The slowness has been reported since removing the extra cache devices. Another thought I had was along the lines of fileystem caching and heavy writes causing read blocking. I have no evidence that this is the case, but some suggestions on list recently of limiting the ZFS memory usage for write caching. Can anybody comment to the effectiveness of this (I have 256MB write cache in front of the slog SSDs and 512MB in front of the primary storage devices). stat(2) doesn't write, so you can stop worrying about the slog. My DTrace is very poor but I’m suspicious that this is the best way to root cause this problem. If somebody has any code that may assist in debugging this problem and was able to share it would much appreciated. Any other suggestions for how to identify this fault and work around it would be greatly appreciated. Rule out the network by looking at retransmissions and ioerrors with netstat(1m) on both the client and server. That behavior sounds a lot like a process has a memory leak and is filling the VM. On Linux there is an OOM killer for these, but on OpenSolaris, your the OOM killer. See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the "Physical Memory Control Using the Resource Capping Daemon" in System Administration Guide: Solaris Containers-Resource Management, and Solaris Zones -- richard You have iSCSI, NFS, CIFS to choose from (most obvious), try restarting them one at a time during down time and see if performance improves after each restart to find the culprit. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yet another "where did the space go" question
Correction On 09/06/09 12:00 PM, I wrote: (there are no hidden directories in / ), Well, there is .zfs, of course, but it is normally hidden, apparently by default on SPARC rpool, but not on X86 rpool or non-rpool pools on either. Hmmm. I don't recollect setting the snapdir property on any pools, ever. - Arrg! It just failed again! # pkg image-update --be-name=snv122 DOWNLOADPKGS FILES XFER (MB) Completed1486/1486 73091/73091 1520.59/1520.59 WARNING: menu.lst file /rpool/boot/menu.lst does not exist, generating a new menu.lst file pkg: Unable to clone the current boot environment. # BE_PRINT_ERR=true beadm create newbe be_get_uuid: failed to get uuid property from BE root dataset user properties. be_get_uuid: failed to get uuid property from BE root dataset user properties. # zfs list -t snapshot | grep newbe rpool/ROOT/opensola...@newbe 30K - 11.9G - rpool/ROOT/opensolaris/o...@newbe0 - 68.6M - So it can create a new BE. So what happened this time? I guess I'll try again with BE_PRINT_ERR=true... Is the get uuid property failure fatal to pkg but not to beadm? Has anyone managed to go from snv111b to snv122 on SPARC? Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wikipedia on ZFS
On Sep 6, 2009, at 11:19 AM, Al Lang wrote: The "limitations" section of the Wikipedia article on ZFS currently includes the statement: "You cannot mix vdev types in a zpool. For example, if you had a striped ZFS pool consisting of disks on a SAN, you cannot add the local-disks as a mirrored vdev." As I understand it, this is simply wrong. You can add any kind of vdev to any zpool. Right? This is so confusing that it makes absolutely no sense. I suspect the author intended to explain the limitation as described in the zpool(1m) man page: Virtual devices cannot be nested, so a mirror or raidz vir- tual device can only contain files or disks. Mirrors of mir- rors (or other combinations) are not allowed. so I edited it :-) As I understand it, mixing vdev types is in general a bad idea. The reliability of a zpool is dictated by its least reliable vdev and the performance of a zpool tends to be limited by its lowest- performing vdevs. So mixing vdev types tends to give you the worst of all possible worlds. It is really a data management issue, not a functionality issue. We really do try to help people not hurt themselves with complexity ;-) -- richard But this is merely a logical consequence of mixing storage types, not anything to do with ZFS. Precisely analogous considerations would apply when setting up RAID-10, for example. Do I have that right? Cheers, Al. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wikipedia on ZFS
On Sun, 6 Sep 2009, Thomas Burgess wrote: if you add a raidz group to a group of 3 mirrors, the entire pool slows down to the speed of the raidz. while you technically CAN do it, it's a horrible idea. I don't think it is necessarily as horrid as you say. Zfs does distribute writes to faster vdevs under heavy write load. There can be vdevs of somewhat different type and size (e.g. raidz and raidz2) which behave similarly enough to not cause much imbalance in the pool. Saying that mixing vdev types is horrid is similar in nature to saying that not using the same type and model of drive throughout the pool is horrid. If you added a raidz vdev to a pool already using raidz vdevs, but the new drives are much faster than the existing drives, then the same sort of imbalance exists. boB -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wikipedia on ZFS
i don't think it's the same at all. I think it's about the same as filling a radiator in a car with oatmeal to make it stop leaking. On Sun, Sep 6, 2009 at 6:26 PM, Bob Friesenhahn < bfrie...@simple.dallas.tx.us> wrote: > On Sun, 6 Sep 2009, Thomas Burgess wrote: > >> >> if you add a raidz group to a group of 3 mirrors, the entire pool slows >> down >> to the speed of the raidz. >> >> while you technically CAN do it, it's a horrible idea. >> > > I don't think it is necessarily as horrid as you say. Zfs does distribute > writes to faster vdevs under heavy write load. There can be vdevs of > somewhat different type and size (e.g. raidz and raidz2) which behave > similarly enough to not cause much imbalance in the pool. > > Saying that mixing vdev types is horrid is similar in nature to saying that > not using the same type and model of drive throughout the pool is horrid. > If you added a raidz vdev to a pool already using raidz vdevs, but the new > drives are much faster than the existing drives, then the same sort of > imbalance exists. > > boB > -- > Bob Friesenhahn > bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On 07/09/2009, at 12:53 AM, Ross Walker wrote: That behavior sounds a lot like a process has a memory leak and is filling the VM. On Linux there is an OOM killer for these, but on OpenSolaris, your the OOM killer. If it was this type of behaviour, where would it be logged when the process was killed/restarted? If it’s not logged by default, can that be enabled? I have not seen any evidence of this in /var/adm/messages, /var/log/ syslog, or my /var/log/debug (*.debug), but perhaps I’m not looking for the right clues. You have iSCSI, NFS, CIFS to choose from (most obvious), try restarting them one at a time during down time and see if performance improves after each restart to find the culprit. The downtime is being reported by users, and I have only seen it once (while in their office) so this method of debugging isn’t going to help, I’m afraid. (this is why I asked about alternate root cause analysis methods) cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On 07/09/2009, at 6:24 AM, Richard Elling wrote: On Sep 6, 2009, at 7:53 AM, Ross Walker wrote: On Sun, Sep 6, 2009 at 9:15 AM, James Lever wrote: I’m experiencing occasional slow responsiveness on an OpenSolaris b118 system typically noticed when running an ‘ls’ (no extra flags, so no directory service lookups). There is a delay of between 2 and 30 seconds but no correlation has been noticed with load on the server and the slow return. This problem has only been noticed via NFS (v3. We are migrating to NFSv4 once the O_EXCL/mtime bug fix has been integrated - anticipated for snv_124). The problem has been observed both locally on the primary filesystem, in an locally automounted reference (/home/foo) and remotely via NFS. I'm confused. If "This problem has only been noticed via NFS (v3" then how is it "observed locally?” Sorry, I was meaning to say it had not been noticed using CIFS or iSCSI. It has been observed in client:/home/user (NFSv3 automount from server:/home/user, redirected to server:/zpool/home/user) and also in server:/home/user (local automount) and server:/zpool/home/user (origin). iostat(1m) is the program for troubleshooting performance issues related to latency. It will show the latency of nfs mounts as well as other devices. What specifically should I be looking for here? (using ‘iostat -xen -T d’) and I’m guessing I’ll require a high level of granularity (1s intervals) to see the issue if it is a single disk or similar. stat(2) doesn't write, so you can stop worrying about the slog. My concern here was I may have been trying to write (via other concurrent processes) at the same time as there was a memory fault from the ARC to L2ARC. Rule out the network by looking at retransmissions and ioerrors with netstat(1m) on both the client and server. No errors or collisions from either server or clients observed. That behavior sounds a lot like a process has a memory leak and is filling the VM. On Linux there is an OOM killer for these, but on OpenSolaris, your the OOM killer. See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the "Physical Memory Control Using the Resource Capping Daemon" in System Administration Guide: Solaris Containers-Resource Management, and Solaris Zones Thanks Richard, I’ll have a look at that today and see where I get. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
Sorry for my earlier post I responded prematurely. On Sep 6, 2009, at 9:15 AM, James Lever wrote: I’m experiencing occasional slow responsiveness on an OpenSolaris b1 18 system typically noticed when running an ‘ls’ (no extra flags, so no directory service lookups). There is a delay of between 2 and 30 seconds but no correlation has been noticed with load on the ser ver and the slow return. This problem has only been noticed via NFS (v3. We are migrating to NFSv4 once the O_EXCL/mtime bug fix has b een integrated - anticipated for snv_124). The problem has been obs erved both locally on the primary filesystem, in an locally automoun ted reference (/home/foo) and remotely via NFS. Have you tried snoop/tcpdump/wirehark on the client side and server side to figure out what is being sent and exactly how long it is taking to get a response? zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/ E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i). This config might lead to heavy sync writes (NFS) starving reads due to the fact that the whole RAIDZ2 behaves as a single disk on writes. How about a 2 5 disk RAIDZ2s or 3 4 disk RAIDZs? Just one or two other vdevs to spread the load can make the world of difference. The system is configured as an NFS (currently serving NFSv3), iSCSI (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) with authentication taking place from a remote openLDAP server. There are a lot of services here, all off one pool? You might be trying to bite off more then the config can chew. Automount is in use both locally and remotely (linux clients). Locally /home/* is remounted from the zpool, remotely /home and another filesystem (and children) are mounted using autofs. There was some suspicion that automount is the problem, but no definitive evidence as of yet. Try taking a particularly bad problem station and configuring it static for a bit to see if it is. The problem has definitely been observed with stats (of some form, typically ‘/usr/bin/ls’ output) both remotely, locally in /home/* and locally in /zpool/home/* (the true source location). There is a clear correlation with recency of reads of the directories in quest ion and reoccurrence of the fault in that one user has scripted a re gular (15m/30m/hourly tests so far) ‘ls’ of the filesystems of interested and this has reduced the fault to have minimal noted impa ct since starting down this path (just for themself). Sounds like the user is pre-fetching his attribute cache to over come poor performance. I have removed the l2arc(s) (cache devices) from the pool and the same behaviour has been observed. My suspicion here was that there was perhaps occasional high synchronous load causing heavy writes to the slog devices and when a stat was requested it may have been faulting from ARC to L2ARC prior to going to the primary data store. The slowness has been reported since removing the extra cache devices. That doesn't make a lot of sense to me the L2ARC is secondary read cache, if writes are starving reads then the L2ARC would only help here. Another thought I had was along the lines of fileystem caching and heavy writes causing read blocking. I have no evidence that this is the case, but some suggestions on list recently of limiting the ZFS memory usage for write caching. Can anybody comment to the effectiveness of this (I have 256MB write cache in front of the slog SSDs and 512MB in front of the primary storage devices). It just may be that the pool configuration just can't handle the write IOPS needed and reads are starving. My DTrace is very poor but I’m suspicious that this is the best way to root cause this problem. If somebody has any code that may assis t in debugging this problem and was able to share it would much appr eciated. Dtrace would tell you, but i wish the learning curve wasn't so steep to get it going. Any other suggestions for how to identify this fault and work around it would be greatly appreciated. I hope I gave some good pointers. I'd first look at the pool configuration. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On Sep 6, 2009, at 5:06 PM, James Lever wrote: On 07/09/2009, at 6:24 AM, Richard Elling wrote: On Sep 6, 2009, at 7:53 AM, Ross Walker wrote: On Sun, Sep 6, 2009 at 9:15 AM, James Lever wrote: I’m experiencing occasional slow responsiveness on an OpenSolaris b118 system typically noticed when running an ‘ls’ (no extra flags, so no directory service lookups). There is a delay of between 2 and 30 seconds but no correlation has been noticed with load on the server and the slow return. This problem has only been noticed via NFS (v3. We are migrating to NFSv4 once the O_EXCL/mtime bug fix has been integrated - anticipated for snv_124). The problem has been observed both locally on the primary filesystem, in an locally automounted reference (/home/foo) and remotely via NFS. I'm confused. If "This problem has only been noticed via NFS (v3" then how is it "observed locally?” Sorry, I was meaning to say it had not been noticed using CIFS or iSCSI. It has been observed in client:/home/user (NFSv3 automount from server:/home/user, redirected to server:/zpool/home/user) and also in server:/home/user (local automount) and server:/zpool/home/user (origin). Ok, just so I am clear, when you mean "local automount" you are on the server and using the loopback -- no NFS or network involved? iostat(1m) is the program for troubleshooting performance issues related to latency. It will show the latency of nfs mounts as well as other devices. What specifically should I be looking for here? (using ‘iostat -xen - T d’) and I’m guessing I’ll require a high level of granularity (1s intervals) to see the issue if it is a single disk or similar. You are looking for I/O that takes seconds to complete or is stuck in the device. This is in the actv column stuck > 1 and the asvc_t >> 1000 stat(2) doesn't write, so you can stop worrying about the slog. My concern here was I may have been trying to write (via other concurrent processes) at the same time as there was a memory fault from the ARC to L2ARC. stat(2) looks at metadata, which is generally small and compressed. It is also cached in the ARC, by default. If this is repeatable in a short period of time, then it is not an I/O problem and you need to look at: 1. the number of files in the directory 2. the locale (ls sorts by default, and your locale affects the sort time) Rule out the network by looking at retransmissions and ioerrors with netstat(1m) on both the client and server. No errors or collisions from either server or clients observed. retrans? As Ross mentioned, wireshark, snoop, or most other network monitors will show network traffic in detail. -- richard That behavior sounds a lot like a process has a memory leak and is filling the VM. On Linux there is an OOM killer for these, but on OpenSolaris, your the OOM killer. See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the "Physical Memory Control Using the Resource Capping Daemon" in System Administration Guide: Solaris Containers-Resource Management, and Solaris Zones Thanks Richard, I’ll have a look at that today and see where I get. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yet another "where did the space go" question
Near Success! After 5 (yes, five) attempts, managed to do an update of snv111b to snv122, until it ran out of space again. Looks like I need to get a bigger disk... Sorry about the monolog, but there might be someone on this list trying to use pkg on SPARC who, like me, has been unable to subscribe to the indiana list, so an update might be useful to any such person... Perhaps someone who can might forward this to the appropriate list -- the issues are known CR's, but don't seem to be mentioned in the release notes. On 09/06/09 04:55 PM, I wrote: WARNING: menu.lst file /rpool/boot/menu.lst does not exist, generating a new menu.lst file pkg: Unable to clone the current boot environment. 1) If there isn't a directory /rpool/boot, pkg will fail 2) If you try again after mkdir /rpool/boot, it will create menu.1st. If it fails for any reason and you have to restart then: 3) If there is a menu.lst containing opensolaris-1 it will fail again even if you had used be-name=. 4) If you delete menu.lst it will fail - touch it after deleting it (the CRs are ambiguous about this). So to do this upgrade, you must do mkdir /rpool/boot and touch /rpool/boot/menu.lst before you start. It might just work if you do this, but only if you have at least 11GB of space to spare (Google says 8GB). BTW pkg always says "/rpool/boot/menu.lst does not exist" even if it does. http://defect.opensolaris.org/bz/show_bug.cgi?id=6744 says "Fixed in source" http://defect.opensolaris.org/bz/show_bug.cgi?id=7880 says "accepted". But the fix for 6744 messes up 7880. This is making a SPARC upgrade really painful, especially annoying since SPARC doesn't even use grub (or menu.lst). Cheers -- Frank PS My hat's off to the ZFS and pkg teams! An amazing accomplishment and a few glitches are to be expected. I'm sure there are fixes in the works, but it would seem upgrading to snv122 isn't in the cards unless I get a bigger 3rd boot disk... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On 07/09/2009, at 11:08 AM, Richard Elling wrote: Ok, just so I am clear, when you mean "local automount" you are on the server and using the loopback -- no NFS or network involved? Correct. And the behaviour has been seen locally as well as remotely. You are looking for I/O that takes seconds to complete or is stuck in the device. This is in the actv column stuck > 1 and the asvc_t >> 1000 Just started having some slow responsiveness reported form a user using emacs (autosave, start of a build) so a small file write request. The second or so before they went to do this, it appears as if the raid cache in front of the slog devices was nearly filled and the SSDs were being utilised quite heavily, but then there was a break where I am seeing relatively light usage on the slog but 100% busy on the device reported. The iostat output is at the end of this message - I can’t make any real sense out of why a user would have seen a ~4s delay at about 2:39:17-18. Only one of the two slog devices are being used at all. Is there some tunable about how multiple slogs are used? c7t[01] are rpool c7t[23] are slog devices in the data pool c11t* are the primary storage devices for the data pool cheers, James Monday, 7 September 2009 2:39:17 PM EST extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.00.00.00.0 0.0 0.00.00.0 0 0 0 10 0 10 c9t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t1d0 0.0 1475.00.0 188799.0 0.0 30.20.0 20.5 2 90 0 0 0 0 c7t2d0 0.0 232.00.0 29571.8 0.0 33.80.0 145.9 0 98 0 0 0 0 c7t3d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t1d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t2d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t3d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t4d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t5d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t6d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t7d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t8d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t9d0 Monday, 7 September 2009 2:39:18 PM EST extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.00.00.00.0 0.0 0.00.00.0 0 0 0 10 0 10 c9t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t1d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t2d0 0.00.00.00.0 0.0 35.00.00.0 0 100 0 0 0 0 c7t3d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t1d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t2d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t3d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t4d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t5d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t6d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t7d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t8d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t9d0 Monday, 7 September 2009 2:39:19 PM EST extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.00.00.00.0 0.0 0.00.00.0 0 0 0 10 0 10 c9t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t1d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t2d0 0.0 341.00.0 43650.1 0.0 35.00.0 102.5 0 100 0 0 0 0 c7t3d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t0d0
Re: [zfs-discuss] periodic slow responsiveness
On 07/09/2009, at 10:46 AM, Ross Walker wrote: zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/ E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i). This config might lead to heavy sync writes (NFS) starving reads due to the fact that the whole RAIDZ2 behaves as a single disk on writes. How about a 2 5 disk RAIDZ2s or 3 4 disk RAIDZs? Just one or two other vdevs to spread the load can make the world of difference. This was a management decision. I wanted to go down the striped mirrored pair solution, but the amount of space lost was considered too great. RAIDZ2 was considered the best value option for our environment. The system is configured as an NFS (currently serving NFSv3), iSCSI (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) with authentication taking place from a remote openLDAP server. There are a lot of services here, all off one pool? You might be trying to bite off more then the config can chew. That’s not a lot of services, really. We have 6 users doing builds on multiple platforms and using the storage as their home directory (windows and unix). The issue is interactive responsiveness and if there is a way to tune the system to give that while still having good performance for builds when they are run. Try taking a particularly bad problem station and configuring it static for a bit to see if it is. That has been considered also, but the issue has also been observed locally on the fileserver. That doesn't make a lot of sense to me the L2ARC is secondary read cache, if writes are starving reads then the L2ARC would only help here. I was suggesting that slog write were possibly starving reads from the l2arc as they were on the same device. This appears not to have been the issue as the problem has persisted even with the l2arc devices removed from the pool. It just may be that the pool configuration just can't handle the write IOPS needed and reads are starving. Possible, but hard to tell. Have a look at the iostat results I’ve posted. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss