Re: [zfs-discuss] Wrong rpool used after reinstall!
On Thu, Aug 04, 2011 at 03:52:39AM -0700, Stuart James Whitefish wrote: > Jim wrote: > > >> But I may be wrong, and anyway the single user shell in the u9 DVD also > >> panics when I try to import tank so maybe that won't help. > > Ian wrote: > > > Put your old drive in a USB enclosure and connect it > > to another system in order to read back the data. > > Given that update 9 can't import the pool is this really worth trying? > I would have to buy the enclosures, if I had them already I would have tried > it in > desperation. > > Jim wrote: > > > > I have only 4 sata ports on this Intel box so I have to keep pulling > > > cables to be > > > able to boot from a DVD and then I won't have all my drives available. I > > > cannot > > > move these drives to any other box because they are consumer drives and > > > my > > > servers all have ultras. > > Ian wrote: > > > Most modern boards will be boot from a live USB > > stick. > > True but I haven't found a way to get an ISO onto a USB that my system can > boot from it. I was using DD to copy the iso to the usb drive. Is there some > other way? Maybe give http://unetbootin.sourceforge.net/ a try. Bill > > This is really frustrating. I haven't had any problems with Linux filesystems > but I heard ZFS was safer. It's really ironic that I lost access to so much > data after moving it to ZFS. Isn't there any way to get them back on my newly > installed U8 system? If I disconnect this pool the system starts fine. > Otherwise my questions above in my summary post might be key to getting this > working. > > Thanks, > Jim > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS web admin - No items found.
Hi, experts, I install Solaris 10 06/06 x86 on vmware 5.5, and admin zfs by command line and web, all is good. Web admin is more convenient, I needn't type commands. But after my computer lost power , and restarted, I get a problem on zfs web admin (https://hostname:6789/zfs). The problem is, when I try to create a new storage pool from web , it always shows "No items found" , but in fact there are 10 harddisks available. I still can use zpool/zfs command line to create new pool, file system, volumes. the command way works quickly and correctly. I have tried to restart the service (smcwebserver), no use. Anyone have the experience on it, is it a bug? Regards, Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS web admin - No items found.
When I run the command, it prompts: # /usr/lib/zfs/availdevs -d Segmentation Fault - core dumped. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS web admin - No items found.
# /usr/lib/zfs/availdevs -d Segmentation Fault - core dumped. # pstack core core 'core' of 2350:./availdevs -d - lwp# 1 / thread# 1 d2d64b3c strlen (0) + c d2fa2f82 get_device_name (8063400, 0, 804751c, 1c) + 3e d2fa3015 get_disk (8063400, 0, 804751c, 8067430) + 4d d2fa3bbf dmgt_avail_disk_iter (8050ddb, 8047554) + a1 08051305 main (2, 8047584, 8047590) + 110 08050ce6 (2, 80476b0, 80476bc, 0, 80476bf, 80476f9) - lwp# 2 / thread# 2 d2de1a81 _door_return (0, 0, 0, 0) + 31 d29f0d3d door_create_func (0) + 29 d2ddf93e _thr_setup (d2992400) + 4e d2ddfc20 _lwp_start (d2992400, 0, 0, d2969ff8, d2ddfc20, d2992400) - lwp# 3 / thread# 3 d2ddfc99 __lwp_park (809afc0, 809afd0, 0) + 19 d2dda501 cond_wait_queue (809afc0, 809afd0, 0, 0) + 3b d2dda9fa _cond_wait (809afc0, 809afd0) + 66 d2ddaa3c cond_wait (809afc0, 809afd0) + 21 d2a92bc8 subscriber_event_handler (80630c0) + 3f d2ddf93e _thr_setup (d275) + 4e d2ddfc20 _lwp_start (d275, 0, 0, d2865ff8, d2ddfc20, d275) - lwp# 4 / thread# 4 d2de0cd5 __pollsys (d274df78, 1, 0, 0) + 15 d2d8a6d2 poll (d274df78, 1, ) + 52 d2d0ee1e watch_mnttab (0) + af d2ddf93e _thr_setup (d2750400) + 4e d2ddfc20 _lwp_start (d2750400, 0, 0, d274dff8, d2ddfc20, d2750400) - lwp# 5 / thread# 5 d2ddfc99 __lwp_park (8064ef0, 8064f00, 0) + 19 d2dda501 cond_wait_queue (8064ef0, 8064f00, 0, 0) + 3b d2dda9fa _cond_wait (8064ef0, 8064f00) + 66 d2ddaa3c cond_wait (8064ef0, 8064f00) + 21 d2a92bc8 subscriber_event_handler (8064be0) + 3f d2ddf93e _thr_setup (d2750800) + 4e d2ddfc20 _lwp_start (d2750800, 0, 0, d24edff8, d2ddfc20, d2750800) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compressed root pool at installation time with flash archive predeployment script
On 03/02/10 12:57, Miles Nordin wrote: "cc" == chad campbell writes: cc> I was trying to think of a way to set compression=on cc> at the beginning of a jumpstart. are you sure grub/ofwboot/whatever can read compressed files? Grub and the sparc zfs boot blocks can read lzjb-compressed blocks in zfs. I have compression=on (and copies=2) for both sparc and x86 roots; I'm told that grub's zfs support also knows how to fall back to ditto blocks if the first copy fails to be readable or has a bad checksum. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] swap across multiple pools
On 03/03/10 05:19, Matt Keenan wrote: In a multipool environment, would be make sense to add swap to a pool outside or the root pool, either as the sole swap dataset to be used or as extra swap ? Yes. I do it routinely, primarily to preserve space on boot disks on large-memory systems. swap can go in any pool, while dump has the same limitations as root: single top-level vdev, single-disk or mirrors only. Would this have any performance implications ? If the non-root pool has many spindles, random read I/O should be faster and thus swap i/o should be faster. I haven't attempted to measure if this makes a difference. I generally set primarycache=metadata on swap zvols but I also haven't been able to measure whether it makes any difference. My users do complain when /tmp fills because there isn't sufficient swap so I do know I need large amounts of swap on these systems. (when migrating one such system from Nevada to Opensolaris recently I forgot to add swap to /etc/vfstab). - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On 03/08/10 12:43, Tomas Ögren wrote: So we tried adding 2x 4GB USB sticks (Kingston Data Traveller Mini Slim) as metadata L2ARC and that seems to have pushed the snapshot times down to about 30 seconds. Out of curiosity, how much physical memory does this system have? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] terrible ZFS performance compared to UFS on ramdisk (70% drop)
On 03/08/10 17:57, Matt Cowger wrote: Change zfs options to turn off checksumming (don't want it or need it), atime, compression, 4K block size (this is the applications native blocksize) etc. even when you disable checksums and compression through the zfs command, zfs will still compress and checksum metadata. the evil tuning guide describes an unstable interface to turn off metadata compression, but I don't see anything in there for metadata checksums. if you have an actual need for an in-memory filesystem, will tmpfs fit the bill? - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Scrub not completing?
On 03/17/10 14:03, Ian Collins wrote: I ran a scrub on a Solaris 10 update 8 system yesterday and it is 100% done, but not complete: scrub: scrub in progress for 23h57m, 100.00% done, 0h0m to go Don't panic. If "zpool iostat" still shows active reads from all disks in the pool, just step back and let it do its thing until it says the scrub is complete. There's a bug open on this: 6899970 scrub/resilver percent complete reporting in zpool status can be overly optimistic scrub/resilver progress reporting compares the number of blocks read so far to the number of blocks currently allocated in the pool. If blocks that have already been visited are freed and new blocks are allocated, the seen:allocated ratio is no longer an accurate estimate of how much more work is needed to complete the scrub. Before the scrub prefetch code went in, I would routinely see scrubs last 75 hours which had claimed to be "100.00% done" for over a day. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sympathetic (or just multiple) drive failures
On 03/19/10 19:07, zfs ml wrote: What are peoples' experiences with multiple drive failures? 1985-1986. DEC RA81 disks. Bad glue that degraded at the disk's operating temperature. Head crashes. No more need be said. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposition of a new zpool property.
On 03/22/10 11:02, Richard Elling wrote: > Scrub tends to be a random workload dominated by IOPS, not bandwidth. you may want to look at this again post build 128; the addition of metadata prefetch to scrub/resilver in that build appears to have dramatically changed how it performs (largely for the better). - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
On 04/05/10 15:24, Peter Schuller wrote: In the urxvt case, I am basing my claim on informal observations. I.e., "hit terminal launch key, wait for disks to rattle, get my terminal". Repeat. Only by repeating it very many times in very rapid succession am I able to coerce it to be cached such that I can immediately get my terminal. And what I mean by that is that it keeps necessitating disk I/O for a long time, even on rapid successive invocations. But once I have repeated it enough times it seems to finally enter the cache. Are you sure you're not seeing unrelated disk update activity like atime updates, mtime updates on pseudo-terminals, etc., ? I'd want to start looking more closely at I/O traces (dtrace can be very helpful here) before blaming any specific system component for the unexpected I/O. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD sale on newegg
On 04/06/10 17:17, Richard Elling wrote: You could probably live with an X25-M as something to use for all three, but of course you're making tradeoffs all over the place. That would be better than almost any HDD on the planet because the HDD tradeoffs result in much worse performance. Indeed. I've set up a couple small systems (one a desktop workstation, and the other a home fileserver) with root pool plus the l2arc and slog for a data pool on an 80G X25-M and have been very happy with the result. The recipe I'm using is to slice the ssd, with the rpool in s0 with roughly half the space, 1GB in s3 for slog, and the rest of the space as L2ARC in s4. That may actually be overly generous for the root pool, but I run with copies=2 on rpool/ROOT and I tend to keep a bunch of BE's around. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Secure delete?
On 04/11/10 10:19, Manoj Joseph wrote: Earlier writes to the file might have left older copies of the blocks lying around which could be recovered. Indeed; to be really sure you need to overwrite all the free space in the pool. If you limit yourself to worrying about data accessible via a regular read on the raw device, it's possible to do this without an outage if you have a spare disk and a lot of time: rough process: 0) delete the files and snapshots containing the data you wish to purge. 1) replace a previously unreplaced disk in the pool with the spare disk using "zpool replace" 2) wait for the replace to complete 3) wipe the removed disk, using the "purge" command of format(1m)'s analyze subsystem or equivalent; the wiped disk is now the spare disk. 4) if all disks have not been replaced yet, go back to step 1. This relies on the fact that the resilver kicked off by "zpool replace" copies only allocated data. There are some assumptions in the above. For one, I'm assuming that that all disks in the pool are the same size. A bigger one is that a "purge" is sufficient to wipe the disks completely -- probably the biggest single assumption, given that the underlying storage devices themselves are increasingly using copy-on-write techniques. The most paranoid will replace all the disks and then physically destroy the old ones. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Secure delete?
On 04/11/10 12:46, Volker A. Brandt wrote: The most paranoid will replace all the disks and then physically destroy the old ones. I thought the most paranoid will encrypt everything and then forget the key... :-) Actually, I hear that the most paranoid encrypt everything *and then* destroy the physical media when they're done with it. Seriously, once encrypted zfs is integrated that's a viable method. It's certainly a new tool to help with the problem, but consider that forgetting a key requires secure deletion of the key. Like most cryptographic techniques, filesystem encryption only changes the size of the problem we need to solve. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggestions about current ZFS setup
On 04/14/10 12:37, Christian Molson wrote: First I want to thank everyone for their input, It is greatly appreciated. To answer a few questions: Chassis I have: http://www.supermicro.com/products/chassis/4U/846/SC846E2-R900.cfm Motherboard: http://www.tyan.com/product_board_detail.aspx?pid=560 RAM: 24 GB (12 x 2GB) 10 x 1TB Seagates 7200.11 10 x 1TB Hitachi 4 x 2TB WD WD20EARS (4K blocks) If you have the spare change for it I'd add one or two SSD's to the mix, with space on them allocated to the root pool plus l2arc cache, and slog for the data pool(s). - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup screwing up snapshot deletion
On 04/14/10 19:51, Richard Jahnel wrote: This sounds like the known issue about the dedupe map not fitting in ram. Indeed, but this is not correct: When blocks are freed, dedupe scans the whole map to ensure each block is not is use before releasing it. That's not correct. dedup uses a data structure which is indexed by the hash of the contents of each block. That hash function is effectively random, so it needs to access a *random* part of the map for each free which means that it (as you correctly stated): ... takes a veeery long time if the map doesn't fit in ram. If you can try adding more ram to the system. Adding a flash-based ssd as an cache/L2ARC device is also very effective; random i/o to ssd is much faster than random i/o to spinning rust. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is it safe/possible to idle HD's in a ZFS Vdev to save wear/power?
On 04/16/10 20:26, Joe wrote: I was just wondering if it is possible to spindown/idle/sleep hard disks that are part of a Vdev& pool SAFELY? it's possible. my ultra24 desktop has this enabled by default (because it's a known desktop type). see the power.conf man page; I think you may need to add an "autopm enable" if the system isn't recognized as a known desktop. the disks spin down when the system is idle; there's a delay of a few seconds when they spin back up. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
On 04/17/10 07:59, Dave Vrona wrote: 1) Mirroring. Leaving cost out of it, should ZIL and/or L2ARC SSDs be mirrored ? L2ARC cannot be mirrored -- and doesn't need to be. The contents are checksummed; if the checksum doesn't match, it's treated as a cache miss and the block is re-read from the main pool disks. The ZIL can be mirrored, and mirroring it improves your ability to recover the pool in the face of multiple failures. 2) ZIL write cache. It appears some have disabled the write cache on the X-25E. This results in a 5 fold performance hit but it eliminates a potential mechanism for data loss. Is this valid? With the ZIL disabled, you may lose the last ~30s of writes to the pool (the transaction group being assembled and written at the time of the crash). With the ZIL on a device with a write cache that ignores cache flush requests, you may lose the tail of some of the intent logs, starting with the first block in each log which wasn't readable after the restart. (I say "may" rather than "will" because some failures may not result in the loss of the write cache). Depending on how quickly your ZIL device pushes writes from cache to stable storage, this may narrow the window from ~30s to less than 1s, but doesn't close the window entirely. If I can mirror ZIL, I imagine this is no longer a concern? Mirroring a ZIL device with a volatile write cache doesn't eliminate this risk. Whether it reduces the risk depends on precisely *what* caused your system to crash and reboot; if the failure also causes loss of the write cache contents on both sides of the mirror, mirroring won't help. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Single-disk pool corrupted after controller failure
On 05/01/10 13:06, Diogo Franco wrote: After seeing that on some cases labels were corrupted, I tried running zdb -l on mine: ... (labels 0, 1 not there, labels 2, 3 are there). I'm looking for pointers on how to fix this situation, since the disk still has available metadata. there are two reasons why you could get this: 1) the labels are gone. 2) the labels are not at the start of what solaris sees as p1, and thus are somewhere else on the disk. I'd look more closely at how freebsd computes the start of the partition or slice '/dev/ad6s1d' that contains the pool. I think #2 is somewhat more likely. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] confused about zpool import -f and export
Hi, all, I think I'm missing a concept with import and export. I'm working on installing a Nexenta b134 system under Xen, and I have to run the installer under hvm mode, then I'm trying to get it back up under pv mode. In that process the controller names change, and that's where I'm getting tripped up. I do a successful install, then I boot OK, but can't export the root pool (OK, fine). So, I boot from the installer cd in rescue mode, do an 'import -f' and then 'export'. That all goes well. When I reconfigure the VM and boot back up in pv mode, if I bring it up under the CD image and do 'zpool import', I get: == r...@nexenta_safemode:~# zpool import pool: syspool id: 5607125904664422185 state: UNAVAIL status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-6X config: syspool UNAVAIL missing device mirror-0ONLINE c0t0d0s0 ONLINE c0t1d0s0 ONLINE Additional devices are known to be part of this pool, though their exact configuration cannot be determined. === I thought the purpose of the export was to remove concerns about which devices are in the pool so it could be reassembled on the other side. But, like I said, I think I'm missing something because 'export' doesn't seem to clear this up. Or maybe it does, but I'm not understanding the other thing that's supposed to be cleared up. This worked back on a 20081207 build, so perhaps something has changed? I'm adding format's view of the disks and a zdb list below. Thanks, -Bill r...@nexenta_safemode:~# format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c0t0d0 /xpvd/x...@51712 1. c0t1d0 /xpvd/x...@51728 Specify disk (enter its number): ^D r...@nexenta_safemode:~# zdb -l /dev/rdsk/c0t0d0s0 LABEL 0 version: 22 name: 'syspool' state: 1 txg: 384 pool_guid: 5607125904664422185 hostid: 4905600 hostname: 'nexenta_safemode' top_guid: 7124011680357776878 guid: 15556832564812580834 vdev_children: 1 vdev_tree: type: 'mirror' id: 0 guid: 7124011680357776878 metaslab_array: 23 metaslab_shift: 32 ashift: 9 asize: 750041956352 is_log: 0 create_txg: 4 children[0]: type: 'disk' id: 0 guid: 15556832564812580834 path: '/dev/dsk/c0d0s0' devid: 'id1,c...@aqemu_harddisk=qm1/a' phys_path: '/p...@0,0/pci-...@1,1/i...@0/c...@0,0:a' whole_disk: 0 create_txg: 4 children[1]: type: 'disk' id: 1 guid: 544113268733868414 path: '/dev/dsk/c0d1s0' devid: 'id1,c...@aqemu_harddisk=qm2/a' phys_path: '/p...@0,0/pci-...@1,1/i...@0/c...@1,0:a' whole_disk: 0 create_txg: 4 LABEL 1 version: 22 name: 'syspool' state: 1 txg: 384 pool_guid: 5607125904664422185 hostid: 4905600 hostname: 'nexenta_safemode' top_guid: 7124011680357776878 guid: 15556832564812580834 vdev_children: 1 vdev_tree: type: 'mirror' id: 0 guid: 7124011680357776878 metaslab_array: 23 metaslab_shift: 32 ashift: 9 asize: 750041956352 is_log: 0 create_txg: 4 children[0]: type: 'disk' id: 0 guid: 15556832564812580834 path: '/dev/dsk/c0d0s0' devid: 'id1,c...@aqemu_harddisk=qm1/a' phys_path: '/p...@0,0/pci-...@1,1/i...@0/c...@0,0:a' whole_disk: 0 create_txg: 4 children[1]: type: 'disk' id: 1 guid: 544113268733868414 path: '/dev/dsk/c0d1s0' devid: 'id1,c...@aqemu_harddisk=qm2/a' phys_path: '/p...@0,0/pci-...@1,1/i...@0/c...@1,0:a' whole_disk: 0 create_txg: 4 LABEL 2 version: 22 name: 'syspool' state: 0 txg: 11520 pool_guid: 15023076366841556794 hostid: 8399112 hostname: 'repository' top_guid: 12107281337513313186
Re: [zfs-discuss] Mirroring USB Drive with Laptop for Backup purposes
On 05/07/2010 11:08 AM, Edward Ned Harvey wrote: I'm going to continue encouraging you to staying "mainstream," because what people do the most is usually what's supported the best. If I may be the contrarian, I hope Matt keeps experimenting with this, files bugs, and they get fixed. His use case is very compelling - I know lots of SOHO folks who could really use a NAS where this 'just worked' The ZFS team has done well by thinking liberally about conventional assumptions. -Bill -- Bill McGonigle, Owner BFC Computing, LLC http://bfccomputing.com/ Telephone: +1.603.448.4440 Email, IM, VOIP: b...@bfccomputing.com VCard: http://bfccomputing.com/vcard/bill.vcf Social networks: bill_mcgonigle/bill.mcgonigle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS - USB 3.0 SSD disk
On 05/06/2010 11:00 AM, Bruno Sousa wrote: Going on the specs it seems to me that if this device has a good price it might be quite useful for caching purposes on ZFS based storage. Not bad, they claim 1TB transfer in 47 minutes: http://www.google.com/search?hl=en&q=1TB%2F47+minutes That's about double what I usually get out of a cheap 'desktop' SATA drive with OpenSolaris. Slower than a RAID-Z2 of 10 of them, though. Still, the power savings could be appreciable. -Bill -- Bill McGonigle, Owner BFC Computing, LLC http://bfccomputing.com/ Telephone: +1.603.448.4440 Email, IM, VOIP: b...@bfccomputing.com VCard: http://bfccomputing.com/vcard/bill.vcf Social networks: bill_mcgonigle/bill.mcgonigle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS root ARC memory usage on VxFS system...
On 05/07/10 15:05, Kris Kasner wrote: Is ZFS swap cached in the ARC? I can't account for data in the ZFS filesystems to use as much ARC as is in use without the swap files being cached.. seems a bit redundant? There's nothing to explicitly disable caching just for swap; from zfs's point of view, the swap zvol is just like any other zvol. But, you can turn this off (assuming sufficiently recent zfs). try: zfs set primarycache=metadata rpool/swap (or whatever your swap zvol is named). (you probably want metadata rather than "none" so that things like indirect blocks for the swap device get cached). - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
On 05/20/10 12:26, Miles Nordin wrote: I don't know, though, what to do about these reports of devices that almost respect cache flushes but seem to lose exactly one transaction. AFAICT this should be a works/doesntwork situation, not a continuum. But there's so much brokenness out there. I've seen similar "tail drop" behavior before -- last write or two before a hardware reset goes into the bit bucket, but ones before that are durable. So, IMHO, a cheap consumer ssd used as a zil may still be worth it (for some use cases) to narrow the window of data loss from ~30 seconds to a sub-second value. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
On 06/15/10 10:52, Erik Trimble wrote: Frankly, dedup isn't practical for anything but enterprise-class machines. It's certainly not practical for desktops or anything remotely low-end. We're certainly learning a lot about how zfs dedup behaves in practice. I've enabled dedup on two desktops and a home server and so far haven't regretted it on those three systems. However, they each have more than typical amounts of memory (4G and up) a data pool in two or more large-capacity SATA drives, plus an X25-M ssd sliced into a root pool as well as l2arc and slog slices for the data pool (see below: [1]) I tried enabling dedup on a smaller system (with only 1G memory and a single very slow disk), observed serious performance problems, and turned it off pretty quickly. I think, with current bits, it's not a simple matter of "ok for enterprise, not ok for desktops". with an ssd for either main storage or l2arc, and/or enough memory, and/or a not very demanding workload, it seems to be ok. For one such system, I'm seeing: # zpool list z NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT z 464G 258G 206G55% 1.25x ONLINE - # zdb -D z DDT-sha256-zap-duplicate: 432759 entries, size 304 on disk, 156 in core DDT-sha256-zap-unique: 1094244 entries, size 298 on disk, 151 in core dedup = 1.25, compress = 1.44, copies = 1.00, dedup * compress / copies = 1.80 - Bill [1] To forestall responses of the form: "you're nuts for putting a slog on an x25-m", which is off-topic for this thread and being discussed elsewhere": Yes, I'm aware of the write cache issues on power fail on the x25-m. For my purposes, it's a better robustness/performance tradeoff than either zil-on-spinning-rust or zil disabled, because: a) for many potential failure cases on whitebox hardware running bleeding edge opensolaris bits, the x25-m will not lose power and thus the write cache will stay intact across a crash. b) even if it loses power and loses some writes-in-flight, it's not likely to lose *everything* since the last txg sync. It's good enough for my personal use. Your mileage will vary. As always, system design involves tradeoffs. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
On 07/20/10 14:10, Marcelo H Majczak wrote: It also seems to be issuing a lot more writing to rpool, though I can't tell what. In my case it causes a lot of read contention since my rpool is a USB flash device with no cache. iostat says something like up to 10w/20r per second. Up to 137 the performance has been enough, so far, for my purposes on this laptop. if pools are more than about 60-70% full, you may be running into 6962304 workaround: add the following to /etc/system, run bootadm update-archive, and reboot -cut here- * Work around 6962304 set zfs:metaslab_min_alloc_size=0x1000 * Work around 6965294 set zfs:metaslab_smo_bonus_pct=0xc8 -cut here- no guarantees, but it's helped a few systems.. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
On 07/20/10 14:10, Marcelo H Majczak wrote: It also seems to be issuing a lot more writing to rpool, though I can't tell what. In my case it causes a lot of read contention since my rpool is a USB flash device with no cache. iostat says something like up to 10w/20r per second. Up to 137 the performance has been enough, so far, for my purposes on this laptop. if pools are more than about 60-70% full, you may be running into 6962304 workaround: add the following to /etc/system, run bootadm update-archive, and reboot -cut here- * Work around 6962304 set zfs:metaslab_min_alloc_size=0x1000 * Work around 6965294 set zfs:metaslab_smo_bonus_pct=0xc8 -cut here- no guarantees, but it's helped a few systems.. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and ZIL on same SSD?
On 07/22/10 04:00, Orvar Korvar wrote: Ok, so the bandwidth will be cut in half, and some people use this configuration. But, how bad is it to have the bandwidth cut in half? Will it hardly notice? For a home server, I doubt you'll notice. I've set up several systems (desktop & home server) as follows: - two large conventional disks, mirrored, as data pool. - single X25-M, 80GB, divided in three slices: 50% in slice 0 as root pool, (with dedup & compression enabled, and copies=2 for rpool/ROOT) 1GB in slice 3 as ZIL for data pool remainder in slice 4 as L2ARC for data pool. two conventional disks + 1 ssd performs much better than two disks alone. If I needed more space (I haven't, yet), I'd add another mirror pair or two to the data pool. I've been very happy with the results. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Increase resilver priority
On 07/23/10 02:31, Giovanni Tirloni wrote: We've seen some resilvers on idle servers that are taking ages. Is it possible to speed up resilver operations somehow? Eg. iostat shows<5MB/s writes on the replaced disks. What build of opensolaris are you running? There were some recent improvements (notably the addition of prefetch to the pool traverse used by scrub and resilver) which sped this up significantly for my systems. Also: if there are large numbers of snapshots, pools seem to take longer to resilver, particularly when there's a lot of metadata divergence between snapshots. Turning off atime updates (if you and your applications can cope with this) may also help going forward. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilvering, amount of data on disk, etc.
On Mon, 2009-10-26 at 10:24 -0700, Brian wrote: > Why does resilvering an entire disk, yield different amounts of data that was > resilvered each time. > I have read that ZFS only resilvers what it needs to, but in the case of > replacing an entire disk with another formatted clean disk, you would think > the amount of data would be the same each time a disk is replaced with an > empty formatted disk. > I'm getting different results when viewing the 'zpool status' info (below) replacing a disk adds an entry to the "zpool history" log, which requires allocating blocks, which will change what's stored in the pool. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sched regularily writing a lots of MBs to the pool?
zfs groups writes together into transaction groups; the physical writes to disk are generally initiated by kernel threads (which appear in dtrace as threads of the "sched" process). Changing the attribution is not going to be simple as a single physical write to the pool may contain data and metadata changes triggered by multiple user processes. You need to go up a level of abstraction and look at the vnode layer to attribute writes to particular processes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe question
On Sat, 2009-11-07 at 17:41 -0500, Dennis Clarke wrote: > Does the dedupe functionality happen at the file level or a lower block > level? it occurs at the block allocation level. > I am writing a large number of files that have the fol structure : > > -- file begins > 1024 lines of random ASCII chars 64 chars long > some tilde chars .. about 1000 of then > some text ( english ) for 2K > more text ( english ) for 700 bytes or so > -- ZFS's default block size is 128K and is controlled by the "recordsize" filesystem property. Unless you changed "recordsize", each of the files above would be a single block distinct from the others. you may or may not get better dedup ratios with a smaller recordsize depending on how the common parts of the file line up with block boundaries. the cost of additional indirect blocks might overwhelm the savings from deduping a small common piece of the file. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] This is the scrub that never ends...
On Fri, 2009-09-11 at 13:51 -0400, Will Murnane wrote: > On Thu, Sep 10, 2009 at 13:06, Will Murnane wrote: > > On Wed, Sep 9, 2009 at 21:29, Bill Sommerfeld wrote: > >>> Any suggestions? > >> > >> Let it run for another day. > > I'll let it keep running as long as it wants this time. > scrub: scrub completed after 42h32m with 0 errors on Thu Sep 10 17:20:19 2009 > > And the people rejoiced. So I guess the issue is more "scrubs may > report ETA very inaccurately" than "scrubs never finish". Thanks for > the suggestions and support. One of my pools routinely does this -- the scrub gets to 100% after about 50 hours but keeps going for another day or more after that. It turns out that zpool reports "number of blocks visited" vs "number of blocks allocated", but clamps the ratio at 100%. If there is substantial turnover in the pool, it appears you may end up needing to visit more blocks than are actually allocated at any one point in time. I made a modified version of the zpool command and this is what it prints for me: ... scrub: scrub in progress for 74h25m, 119.90% done, 0h0m to go 5428197411840 blocks examined, 4527262118912 blocks allocated ... This is the (trivial) source change I made to see what's going on under the covers: diff -r 12fb4fb507d6 usr/src/cmd/zpool/zpool_main.c --- a/usr/src/cmd/zpool/zpool_main.cMon Oct 26 22:25:39 2009 -0700 +++ b/usr/src/cmd/zpool/zpool_main.cTue Nov 10 17:07:59 2009 -0500 @@ -2941,12 +2941,15 @@ if (examined == 0) examined = 1; - if (examined > total) - total = examined; fraction_done = (double)examined / total; - minutes_left = (uint64_t)((now - start) * - (1 - fraction_done) / fraction_done / 60); + if (fraction_done < 1) { + minutes_left = (uint64_t)((now - start) * + (1 - fraction_done) / fraction_done / 60); + } else { + minutes_left = 0; + } + minutes_taken = (uint64_t)((now - start) / 60); (void) printf(gettext("%s in progress for %lluh%um, %.2f%% done, " @@ -2954,6 +2957,9 @@ scrub_type, (u_longlong_t)(minutes_taken / 60), (uint_t)(minutes_taken % 60), 100 * fraction_done, (u_longlong_t)(minutes_left / 60), (uint_t)(minutes_left % 60)); + (void) printf(gettext("\t %lld blocks examined, %lld blocks allocated\n"), + examined, + total); } static void ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs eradication
On Wed, 2009-11-11 at 10:29 -0800, Darren J Moffat wrote: > Joerg Moellenkamp wrote: > > Hi, > > > > Well ... i think Darren should implement this as a part of > zfs-crypto. Secure Delete on SSD looks like quite challenge, when wear > leveling and bad block relocation kicks in ;) > > No I won't be doing that as part of the zfs-crypto project. As I said > some jurisdictions are happy that if the data is encrypted then > overwrite of the blocks isn't required. For those that aren't use > dd(1M) or format(1M) may be sufficient - if that isn't then nothing > short of physical destruction is likely good enough. note that "eradication" via overwrite makes no sense if the underlying storage uses copy-on-write, because there's no guarantee that the newly written block actually will overlay the freed block. IMHO the sweet spot here may be to overwrite once with zeros (allowing the block to be compressed out of existance if the underlying storage is a compressed zvol or equivalent) or to use the TRIM command. (It may also be worthwhile for zvols exported via various protocols to themselves implement the TRIM command -- freeing the underlying storage). - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver/scrub times?
Yesterday's integration of 6678033 resilver code should prefetch as part of changeset 74e8c05021f1 (which should be in build 129 when it comes out) may improve scrub times, particularly if you have a large number of small files and a large number of snapshots. I recently tested an early version of the fix, and saw one pool go from an elapsed time of 85 hours to 20 hours; another (with many fewer snapshots) went from 35 to 17. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] USB sticks show on one set of devices in zpool, different devices in format
Hello, I had snv_111b running for a while on a HP DL160G5. With two 16GB USB sticks comprising the mirrored rpool for boot. And four 1TB drives comprising another pool, pool1, for data. So that's been working just fine for a few months. Yesterday I get it into my mind to upgrade the OS to latest, then was snv_127. That worked, and all was well. Also did an upgrade to the DL160G5's BIOs firmware. All was cool and running as snv_127 just fine. Upgraded zfs from 13 to 19. See pool status post-upgrade: r...@arc:/# zpool status pool: pool1 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0ONLINE 0 0 0 c2t0d0s0 ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 errors: No known data errors Today I went to activate the BE for the new snv_127 install that I've been manually booting into, but "beadm activate..." will always fail, see here: r...@arc:~# export BE_PRINT_ERR=true r...@arc:~# beadm activate opensolaris-snv127 be_do_installgrub: installgrub failed for device c2t0d0s0. Unable to activate opensolaris-snv127. Unknown external error. So I tried the installgrub manually and get this: r...@arc:~# installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c2t0d0s0 cannot open/stat device /dev/rdsk/c2t0d0s2 OK, wtf? The rpool status shows both of my USB sticks alive and well at c2t0d0s0 and c1t0d0s0... But when I run "format -e" I see this: r...@arc:/# format -e Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c7t1d0 /p...@0,0/pci8086,4...@1/pci103c,3...@0/s...@1,0 1. c7t2d0 /p...@0,0/pci8086,4...@1/pci103c,3...@0/s...@2,0 2. c7t3d0 /p...@0,0/pci8086,4...@1/pci103c,3...@0/s...@3,0 3. c7t4d0 /p...@0,0/pci8086,4...@1/pci103c,3...@0/s...@4,0 4. c8t0d0 /p...@0,0/pci103c,3...@1d,7/stor...@8/d...@0,0 5. c11t0d0 /p...@0,0/pci103c,3...@1d,7/stor...@6/d...@0,0 Specify disk (enter its number): 4 selecting c8t0d0 [disk formatted] /dev/dsk/c8t0d0s0 is part of active ZFS pool rpool. Please see zpool(1M). It shows my two USB sticks of the rpool being at c8t0d0 and c11t0d0... ! How is this system even working? What do I need to do to clear this up...? Thanks for your time, -Bill -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on ssd
On Fri, 2009-12-11 at 13:49 -0500, Miles Nordin wrote: > > "sh" == Seth Heeren writes: > > sh> If you don't want/need log or cache, disable these? You might > sh> want to run your ZIL (slog) on ramdisk. > > seems quite silly. why would you do that instead of just disabling > the ZIL? I guess it would give you a way to disable it pool-wide > instead of system-wide. > > A per-filesystem ZIL knob would be awesome. for what it's worth, there's already a per-filesystem ZIL knob: the "logbias" property. It can be set either to "latency" or "throughput". ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool fragmentation issues?
Hi Everyone, I hope this is the right forum for this question. A customer is using a Thumper as an NFS file server to provide the mail store for multiple email servers (Dovecot). They find that when a zpool is freshly created and populated with mail boxes, even to the extent of 80-90% capacity, performance is ok for the users, backups and scrubs take a few hours (4TB of data). There are around 100 file systems. After running for a while (couple of months) the zpool seems to get "fragmented", backups take 72 hours and a scrub takes about 180 hours. They are running mirrors with about 5TB usable per pool (500GB disks). Being a mail store, the writes and reads are small and random. Record size has been set to 8k (improved performance dramatically). The backup application is Amanda. Once backups become too tedious, the remedy is to replicate the pool and start over. Things get fast again for a while. Is this expected behavior given the application (email - small, random writes/reads)? Are there recommendations for system/ZFS/NFS configurations to improve this sort of thing? Are there best practices for structuring backups to avoid a directory walk? Thanks, bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] force 4k writes?
This is most likely a naive question on my part. If recordsize is set to 4k (or a multiple of 4k), will ZFS ever write a record that is less than 4k or not a multiple of 4k? This includes metadata. Does compression have any effect on this? thanks for the help, bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool fragmentation issues?
On Tue, 2009-12-15 at 17:28 -0800, Bill Sprouse wrote: > After > running for a while (couple of months) the zpool seems to get > "fragmented", backups take 72 hours and a scrub takes about 180 > hours. Are there periodic snapshots being created in this pool? Can they run with atime turned off? (file tree walks performed by backups will update the atime of all directories; this will generate extra write traffic and also cause snapshots to diverge from their parents and take longer to scrub). - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] force 4k writes
Hi Richard, How's the ranch? ;-) This is most likely a naive question on my part. If recordsize is set to 4k (or a multiple of 4k), will ZFS ever write a record that is less than 4k or not a multiple of 4k? Yes. The recordsize is the upper limit for a file record. This includes metadata. Yes. Metadata is compressed and seems to usually be one block. Does compression have any effect on this? Yes. 4KB is the minimum size that can be compressed for regular data. NB. Physical writes may be larger because they are coalesced. But if you are worried about recordsize, then you are implicitly worried about reads. The question behind the question is, given the really bad things that can happen performance-wise with writes that are not 4k aligned when using flash devices, is there any way to insure that any and all writes from ZFS are 4k aligned? -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool fragmentation issues?
On Dec 15, 2009, at 6:24 PM, Bill Sommerfeld wrote: On Tue, 2009-12-15 at 17:28 -0800, Bill Sprouse wrote: After running for a while (couple of months) the zpool seems to get "fragmented", backups take 72 hours and a scrub takes about 180 hours. Are there periodic snapshots being created in this pool? Yes, every two hours. Can they run with atime turned off? I'm not sure, but I expect they can. I'll ask. (file tree walks performed by backups will update the atime of all directories; this will generate extra write traffic and also cause snapshots to diverge from their parents and take longer to scrub). - Bill Thanks! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool fragmentation issues?
Hi Bob, On Dec 15, 2009, at 6:41 PM, Bob Friesenhahn wrote: On Tue, 15 Dec 2009, Bill Sprouse wrote: Hi Everyone, I hope this is the right forum for this question. A customer is using a Thumper as an NFS file server to provide the mail store for multiple email servers (Dovecot). They find that when a zpool is freshly created and It seems that Dovecot's speed optimizations for mbox format are specially designed to break zfs "http://wiki.dovecot.org/MailboxFormat/mbox#Dovecot.27s_Speed_Optimizations " and explains why using a tiny 8k recordsize temporarily "improved" performance. Tiny updates seem to be abnormal for a mail server. The many tiny updates combined with zfs COW conspire to spread the data around the disk, requiring a seek for each 8k of data. If more data was written at once, and much larger blocks were used, then the filesystem would continue to perform much better, although perhaps less well initially. If the system has sufficient RAM, or a large enough L2ARC, then Dovecot's optimizations to diminish reads become meaningless. I think one of the reasons they went to small recordsizes was an issue where they were getting killed with reads of small messages and having to pull in 128K records each time. The smaller recordsizes seem to have improved that aspect at least. Thanks for the pointer to the Dovecot notes. Is this expected behavior given the application (email - small, random writes/reads)? Are there recommendations for system/ZFS/NFS configurations to improve this sort of thing? Are there best practices for structuring backups to avoid a directory walk? Zfs works best when whole files are re-written rather than updated in place as Dovecot seems to want to do. Either the user mailboxes should be re-written entirely when they are "expunged" or else a different mail storage format which writes entire files, or much larger records, should be used. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool fragmentation issues?
Thanks MIchael, Useful stuff to try. I wish we could add more memory, but the x4500 is limited to 16GB. Compression was a question. Its currently off, but they were thinking of turning it on. bill On Dec 15, 2009, at 7:02 PM, Michael Herf wrote: I have also had slow scrubbing on filesystems with lots of files, and I agree that it does seem to degrade badly. For me, it seemed to go from 24 hours to 72 hours in a matter of a few weeks. I did these things on a pool in-place, which helped a lot (no rebuilding): 1. reduced number of snapshots (auto snapshots can generate a lot of files). 2. disabled compression and rebuilt affected datasets (is compression on?) 3. upgraded to b129, which has metadata prefetch for scrub, seems to help by ~2x? 4. tar'd up some extremely large folders 5. added 50% more RAM. 6. turned off atime My scrubs went from 80 hours to 12 with these changes. (4TB used, ~10M files + 10 snapshots each.) I haven't figured out if "disable compression" vs. "fewer snapshots/ files and more RAM" made a bigger difference. I'm assuming that once the number of files exceeds ARC, you get dramatically lower performance, and maybe that compression has some additional overhead, but I don't know, this is just what worked. It would be nice to have a benchmark set for features like this & general recommendations for RAM/ARC size, based on number of files, etc. How does ARC usage scale with snapshots? Scrub on a huge maildir machine seems like it would make a nice benchmark. I used "zdb -d pool" to figure out which filesystems had a lot of objects, and figured out places to trim based on that. mike On Tue, Dec 15, 2009 at 6:41 PM, Bob Friesenhahn > wrote: On Tue, 15 Dec 2009, Bill Sprouse wrote: Hi Everyone, I hope this is the right forum for this question. A customer is using a Thumper as an NFS file server to provide the mail store for multiple email servers (Dovecot). They find that when a zpool is freshly created and It seems that Dovecot's speed optimizations for mbox format are specially designed to break zfs "http://wiki.dovecot.org/MailboxFormat/mbox#Dovecot.27s_Speed_Optimizations " and explains why using a tiny 8k recordsize temporarily "improved" performance. Tiny updates seem to be abnormal for a mail server. The many tiny updates combined with zfs COW conspire to spread the data around the disk, requiring a seek for each 8k of data. If more data was written at once, and much larger blocks were used, then the filesystem would continue to perform much better, although perhaps less well initially. If the system has sufficient RAM, or a large enough L2ARC, then Dovecot's optimizations to diminish reads become meaningless. Is this expected behavior given the application (email - small, random writes/reads)? Are there recommendations for system/ZFS/NFS configurations to improve this sort of thing? Are there best practices for structuring backups to avoid a directory walk? Zfs works best when whole files are re-written rather than updated in place as Dovecot seems to want to do. Either the user mailboxes should be re-written entirely when they are "expunged" or else a different mail storage format which writes entire files, or much larger records, should be used. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool fragmentation issues?
Hi Brent, I'm not sure why Dovecot was chosen. It was most likely a recommendation by a fellow University. I agree that it lacking in efficiencies in a lot of areas. I don't think I would be successful in suggesting a change at this point as I have already suggested a couple of alternatives without success. Do you a have a pointer to the "block/parity rewrite" tool mentioned below? bill On Dec 15, 2009, at 9:38 PM, Brent Jones wrote: On Tue, Dec 15, 2009 at 5:28 PM, Bill Sprouse wrote: Hi Everyone, I hope this is the right forum for this question. A customer is using a Thumper as an NFS file server to provide the mail store for multiple email servers (Dovecot). They find that when a zpool is freshly created and populated with mail boxes, even to the extent of 80-90% capacity, performance is ok for the users, backups and scrubs take a few hours (4TB of data). There are around 100 file systems. After running for a while (couple of months) the zpool seems to get "fragmented", backups take 72 hours and a scrub takes about 180 hours. They are running mirrors with about 5TB usable per pool (500GB disks). Being a mail store, the writes and reads are small and random. Record size has been set to 8k (improved performance dramatically). The backup application is Amanda. Once backups become too tedious, the remedy is to replicate the pool and start over. Things get fast again for a while. Is this expected behavior given the application (email - small, random writes/reads)? Are there recommendations for system/ZFS/NFS configurations to improve this sort of thing? Are there best practices for structuring backups to avoid a directory walk? Thanks, bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Anyone reason in particular they chose to use Dovecot with the old Mbox format? Mbox has been proven many times over to be painfully slow when the files get larger, and in this day and age, I can't imagine anyone having smaller than a 50MB mailbox. We have about 30,000 e-mail users on various systems, and it seems the average size these days is approaching close to a GB. Though Dovecot has done a lot to improve the performance of Mbox mailboxes, Maildir might be more rounded for your system. I wonder if the "soon to be released" block/parity rewrite tool will "freshen" up a pool thats heavily fragmented, without having to redo the pools. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool fragmentation issues?
Just checked w/customer and they are using the MailDir functionality with Dovecot. On Dec 16, 2009, at 11:28 AM, Toby Thain wrote: On 16-Dec-09, at 10:47 AM, Bill Sprouse wrote: Hi Brent, I'm not sure why Dovecot was chosen. It was most likely a recommendation by a fellow University. I agree that it lacking in efficiencies in a lot of areas. I don't think I would be successful in suggesting a change at this point as I have already suggested a couple of alternatives without success. (As Damon pointed out) The problem seems not Dovecot per se but the choice of mbox format, which is rather self-evidently inefficient. Do you a have a pointer to the "block/parity rewrite" tool mentioned below? It headlines the informal roadmap presented by Jeff Bonwick. http://www.snia.org/events/storage-developer2009/presentations/monday/JeffBonwick_zfs-What_Next-SDC09.pdf --Toby bill On Dec 15, 2009, at 9:38 PM, Brent Jones wrote: On Tue, Dec 15, 2009 at 5:28 PM, Bill Sprouse wrote: Hi Everyone, I hope this is the right forum for this question. A customer is using a Thumper as an NFS file server to provide the mail store for multiple email servers (Dovecot). They find that when a zpool is freshly created and populated with mail boxes, even to the extent of 80-90% capacity, performance is ok for the users, backups and scrubs take a few hours (4TB of data). There are around 100 file systems. After running for a while (couple of months) the zpool seems to get "fragmented", backups take 72 hours and a scrub takes about 180 hours. They are running mirrors with about 5TB usable per pool (500GB disks). Being a mail store, the writes and reads are small and random. Record size has been set to 8k (improved performance dramatically). The backup application is Amanda. Once backups become too tedious, the remedy is to replicate the pool and start over. Things get fast again for a while. Is this expected behavior given the application (email - small, random writes/reads)? Are there recommendations for system/ZFS/NFS configurations to improve this sort of thing? Are there best practices for structuring backups to avoid a directory walk? Thanks, bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Anyone reason in particular they chose to use Dovecot with the old Mbox format? Mbox has been proven many times over to be painfully slow when the files get larger, and in this day and age, I can't imagine anyone having smaller than a 50MB mailbox. We have about 30,000 e-mail users on various systems, and it seems the average size these days is approaching close to a GB. Though Dovecot has done a lot to improve the performance of Mbox mailboxes, Maildir might be more rounded for your system. I wonder if the "soon to be released" block/parity rewrite tool will "freshen" up a pool thats heavily fragmented, without having to redo the pools. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
Thanks for this thread! I was just coming here to discuss this very same problem. I'm running 2009.06 on a Q6600 with 8GB of RAM. I have a Windows system writing multiple OTA HD video streams via CIFS to the 2009.06 system running Samba. I then have multiple clients reading back other HD video streams. The write client never skips a beat, but the read clients have constant problems getting data when the "burst" writes occur. I am now going to try the txg_timeout and see if that helps. It would be nice if these tunables were settable on a per-pool basis though. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disks and caches
On Thu, 2010-01-07 at 11:07 -0800, Anil wrote: > There is talk about using those cheap disks for rpool. Isn't rpool > also prone to a lot of writes, specifically when the /tmp is in a SSD? Huh? By default, solaris uses tmpfs for /tmp, /var/run, and /etc/svc/volatile; writes to those filesystems won't hit the SSD unless the system is short on physical memory. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Degrated pool menbers excluded from writes ?
On 01/24/10 12:20, Lutz Schumann wrote: One can see that the degrated mirror is excluded from the writes. I think this is expected behaviour right ? (data protection over performance) That's correct. It will use the space if it needs to but it prefers to avoid "sick" top-level vdevs if there are healthy ones available. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zvol being charged for double space
On 01/27/10 21:17, Daniel Carosone wrote: This is as expected. Not expected is that: usedbyrefreservation = refreservation I would expect this to be 0, since all the reserved space has been allocated. This would be the case if the volume had no snapshots. As a result, used is over twice the size of the volume (+ a few small snapshots as well). I'm seeing essentially the same thing with a recently-created zvol with snapshots that I export via iscsi for time machine backups on a mac. % zfs list -r -o name,refer,used,usedbyrefreservation,refreservation,volsize z/tm/mcgarrett NAMEREFER USED USEDREFRESERV REFRESERV VOLSIZE z/tm/mcgarrett 26.7G 88.2G60G60G 60G The actual volume footprint is a bit less than half of the volume size, but the refreservation ensures that there is enough free space in the pool to allow me to overwrite every block of the zvol with uncompressable data without any writes failing due to the pool being out of space. If you were to disable time-based snapshots and then overwrite a measurable fraction of the zvol you I'd expect "USEDBYREFRESERVATION" to shrink as the reserved blocks were actually used. If you want to allow for overcommit, you need to delete the refreservation. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] server hang with compression on, ping timeouts from remote machine
On 01/31/10 07:07, Christo Kutrovsky wrote: I've also experienced similar behavior (short freezes) when running zfs send|zfs receive with compression on LOCALLY on ZVOLs again. Has anyone else experienced this ? Know any of bug? This is on snv117. you might also get better results after the fix to: 6881015 ZFS write activity prevents other threads from running in a timely manner which was fixed in build 129. As a workaround, try a lower gzip compression level -- higher gzip levels usually burn lots more CPU without significantly increasing the compression ratio. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] most of my space is gone
On 02/06/10 08:38, Frank Middleton wrote: AFAIK there is no way to get around this. You can set a flag so that pkg tries to empty /var/pkg/downloads, but even though it looks empty, it won't actually become empty until you delete the snapshots, and IIRC you still have to manually delete the contents. I understand that you can try creating a separate dataset and mounting it on /var/pkg, but I haven't tried it yet, and I have no idea if doing so gets around the BE snapshot problem. You can set the environment variable PKG_CACHEDIR to place the cache in an alternate filesystem. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
On 02/11/10 10:33, Lori Alt wrote: This bug is closed as a dup of another bug which is not readable from the opensolaris site, (I'm not clear what makes some bugs readable and some not). the other bug in question was opened yesterday and probably hasn't had time to propagate. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ZIL + L2ARC SSD Setup
On 02/12/10 09:36, Felix Buenemann wrote: given I've got ~300GB L2ARC, I'd need about 7.2GB RAM, so upgrading to 8GB would be enough to satisfy the L2ARC. But that would only leave ~800MB free for everything else the server needs to do. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On 02/26/10 10:45, Paul B. Henson wrote: I've already posited as to an approach that I think would make a pure-ACL deployment possible: http://mail.opensolaris.org/pipermail/zfs-discuss/2010-February/037206.html Via this concept or something else, there needs to be a way to configure ZFS to prevent the attempted manipulation of legacy permission mode bits from breaking the security policy of the ACL. I believe this proposal is sound. In it, you wrote: The feedback was that the internal Sun POSIX compliance police wouldn't like that ;). There are already per-filesystem tunables for ZFS which allow the system to escape the confines of POSIX (noatime, for one); I don't see why a "chmod doesn't truncate acls" option couldn't join it so long as it was off by default and left off while conformance tests were run. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Freeing unused space in thin provisioned zvols
On 02/26/10 11:42, Lutz Schumann wrote: Idea: - If the guest writes a block with 0's only, the block is freed again - if someone reads this block again - it wil get the same 0's it would get if the 0's would be written - The checksum of a "all 0" block dan be hard coded for SHA1 / Flecher, so the comparison for "is this a "0 only" block is easy. With this in place, a host wishing to free thin provisioned zvol space can fill the unused blocks wirth 0s easity with simple tools (e.g. dd if=/dev/zero of=/MYFILE bs=1M; rm /MYFILE) and the space is freed again on the zvol side. You've just described how ZFS behaves when compression is enabled -- a block of zeros is compressed to a hole represented by an all-zeros block pointer. > Does anyone know why this is not incorporated into ZFS ? It's in there. Turn on compression to use it. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On 02/26/10 17:38, Paul B. Henson wrote: As I wrote in that new sub-thread, I see no option that isn't surprising in some way. My preference would be for what I labeled as option (b). And I think you absolutely should be able to configure your fileserver to implement your preference. Why shouldn't I be able to configure my fileserver to implement mine :)? acl-chmod interactions have been mishandled so badly in the past that i think a bit of experimentation with differing policies is in order. Based on the amount of wailing I see around acls, I think that, based on personal experience with both systems, AFS had it more or less right and POSIX got it more or less wrong -- once you step into the world of acls, the file mode should be mostly ignored, and an accidental chmod should *not* destroy carefully crafted acls. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS compression and deduplication on root pool on SSD
On 02/28/10 15:58, valrh...@gmail.com wrote: Also, I don't have the numbers to prove this, but it seems to me > that the actual size of rpool/ROOT has grown substantially since I > did a clean install of build 129a (I'm now at build133). WIthout > compression, either, that was around 24 GB, but things seem > to have accumulated by an extra 11 GB or so. One common source for this is slowly accumulating files under /var/pkg/download. Clean out /var/pkg/download and delete all but the most recent boot environment to recover space (you need to do this to get the space back because the blocks are referenced by the snapshots used by each clone as its base version). To avoid this in the future, set PKG_CACHEDIR in your environment to point at a filesystem which isn't cloned by beadm -- something outside rpool/ROOT, for instance. On several systems which have two pools (root & data) I've relocated it to the data pool - it doesn't have to be part of the root pool. This has significantly slimmed down my root filesystem on systems which are chasing the dev branch of opensolaris. > At present, my rpool/ROOT has no compression, and no deduplication. I > was wondering about whether it would be a good idea, from a > performance and data integrity standpoint, to use one, the other, or > both, on the root pool. I've used the combination of copies=2 and compression=yes on rpool/ROOT for a while and have been happy with the result. On one system I recently moved to an ssd root, I also turned on dedup and it seems to be doing just fine: NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT r2 37G 14.7G 22.3G39% 1.31x ONLINE - (the relatively high dedup ratio is because I have one live upgrade BE with nevada build 130, and a beadm BE with opensolaris build 130, which is mostly the same) - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On 03/01/10 13:50, Miles Nordin wrote: "dd" == David Dyer-Bennet writes: dd> Okay, but the argument goes the other way just as well -- when dd> I run "chmod 6400 foobar", I want the permissions set that dd> specific way, and I don't want some magic background feature dd> blocking me. This will be true either way. Even if chmod isn't ignored, it will reach into the nest of ACL's and mangle them in some non-obvious way with unpredictable consequences, and the mangling will be implemented by a magical background feature. actually, you can be surprised even if there are no acls in use -- if, unbeknownst to you, some user has been granted file_dac_read or file_dac_write privilege, they will be able to bypass the file modes for read and/or for write. Likewise if that user has been delegated zfs "send" rights on the filesystem the file is in, they'll be able to read every bit of the file. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On 03/02/10 08:13, Fredrich Maney wrote: Why not do the same sort of thing and use that extra bit to flag a file, or directory, as being an ACL only file and will negate the rest of the mask? That accomplishes what Paul is looking for, without breaking the existing model for those that need/wish to continue to use it? While we're designing on the fly: Another possibility would be to use an additional umask bit or two to influence the mode-bit - acl interaction. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs receive slowness - lots of systime spent in genunix`list_next ?
On 12/05/11 10:47, Lachlan Mulcahy wrote: > zfs`lzjb_decompress10 0.0% > unix`page_nextn31 0.0% > genunix`fsflush_do_pages 37 0.0% > zfs`dbuf_free_range 183 0.1% > genunix`list_next5822 3.7% > unix`mach_cpu_idle 150261 96.1% your best bet in a situation like this -- where there's a lot of cpu time spent in a generic routine -- is to use an alternate profiling method that shows complete stack traces rather than just the top function on the stack. often the names of functions two or three or four deep in the stack will point at what's really responsible. something as simple as: dtrace -n 'profile-1001 { @[stack()] = count(); }' (let it run for a bit then interrupt it). should show who's calling list_next() so much. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)
On 05/28/12 17:13, Daniel Carosone wrote: There are two problems using ZFS on drives with 4k sectors: 1) if the drive lies and presents 512-byte sectors, and you don't manually force ashift=12, then the emulation can be slow (and possibly error prone). There is essentially an internal RMW cycle when a 4k sector is partially updated. We use ZFS to get away from the perils of RMW :) 2) with ashift=12, whther forced manually or automatically because the disks present 4k sectors, ZFS is less space-efficient for metadata and keeps fewer historical uberblocks. two, more specific, problems I've run into recently: 1) if you move a disk with an ashift=9 pool on it from a controller/enclosure/.. combo where it claims to have 512 byte sectors to a path where it is detected as having 4k sectors (even if it can cope with 512-byte aligned I/O), the pool will fail to import and appear to be gravely corrupted; the error message you get will make no mention of the sector size change. Move the disk back to the original location and it imports cleanly. 2) if you have a pool with ashift=9 and a disk dies, and the intended replacement is detected as having 4k sectors, it will not be possible to attach the disk as a replacement drive.. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] "shareiscsi" and COMSTAR
On Tue, Jun 26, 2012 at 1:47 PM, Jim Klimov wrote: > 1) Is COMSTAR still not-integrated with shareiscsi ZFS attributes? > Or can the pool use the attribute, and the correct (new COMSTAR) > iSCSI target daemon will fire up? I can't speak for Solaris 11, but for illumos, you need to use the stmfadm, itadm, and related tools, not the shareiscsi ZFS property. > 2) What would be the best way to migrate iSCSI server configuration > (LUs, views, allowed client lists, etc.) - is it sufficient to > just export the SMF config of "stmf" service, or do I also need > some other services and/or files (/etc/iscsi, something else?) If you're migrating from the old iSCSI target daemon to COMSTAR, I would recommend doing the migration manually and rebuilding the iSCSI configuration. While this blog entry is written to users of the 7000 series storage appliance, it may be useful as you're thinking about how to proceed: https://blogs.oracle.com/wdp/entry/comstar_iscsi - Bill -- Bill Pijewski, Joyent http://dtrace.org/blogs/wdp/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/12 02:10, Sašo Kiselkov wrote: > Oh jeez, I can't remember how many times this flame war has been going > on on this list. Here's the gist: SHA-256 (or any good hash) produces a > near uniform random distribution of output. Thus, the chances of getting > a random hash collision are around 2^-256 or around 10^-77. I think you're correct that most users don't need to worry about this -- sha-256 dedup without verification is not going to cause trouble for them. But your analysis is off. You're citing the chance that two blocks picked at random will have the same hash. But that's not what dedup does; it compares the hash of a new block to a possibly-large population of other hashes, and that gets you into the realm of "birthday problem" or "birthday paradox". See http://en.wikipedia.org/wiki/Birthday_problem for formulas. So, maybe somewhere between 10^-50 and 10^-55 for there being at least one collision in really large collections of data - still not likely enough to worry about. Of course, that assumption goes out the window if you're concerned that an adversary may develop practical ways to find collisions in sha-256 within the deployment lifetime of a system. sha-256 is, more or less, a scaled-up sha-1, and sha-1 is known to be weaker than the ideal 2^80 strength you'd expect from 2^160 bits of hash; the best credible attack is somewhere around 2^57.5 (see http://en.wikipedia.org/wiki/SHA-1#SHA-1). on a somewhat less serious note, perhaps zfs dedup should contain "chinese lottery" code (see http://tools.ietf.org/html/rfc3607 for one explanation) which asks the sysadmin to report a detected sha-256 collision to eprint.iacr.org or the like... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
On 07/19/12 18:24, Traffanstead, Mike wrote: iozone doesn't vary the blocksize during the test, it's a very artificial test but it's useful for gauging performance under different scenarios. So for this test all of the writes would have been 64k blocks, 128k, etc. for that particular step. Just as another point of reference I reran the test with a Crucial M4 SSD and the results for 16G/64k were 35mB/s (x5 improvement). I'll rerun that part of the test with zpool iostat and see what it says. For random writes to work without forcing a lot of read i/o and read-modify-write sequences, set the recordsize on the filesystem used for the test to match the iozone recordsize. For instance: zfs set recordsize=64k $fsname and ensure that the files used for the test are re-created after you make this setting change ("recordsize" is sticky at file creation time). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zvol vs zfs send/zfs receive
On 09/14/12 22:39, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Dave Pooser Unfortunately I did not realize that zvols require disk space sufficient to duplicate the zvol, and my zpool wasn't big enough. After a false start (zpool add is dangerous when low on sleep) I added a 250GB mirror and a pair of 3GB mirrors to miniraid and was able to successfully snapshot the zvol: miniraid/RichRAID@exportable This doesn't make any sense to me. The snapshot should not take up any (significant) space on the sending side. It's only on the receiving side, trying to receive a snapshot, that you require space. Because it won't clobber the existing zvol on the receiving side until the complete new zvol was received to clobber it with. But simply creating the snapshot on the sending side should be no problem. By default, zvols have reservations equal to their size (so that writes don't fail due to the pool being out of space). Creating a snapshot in the presence of a reservation requires reserving enough space to overwrite every block on the device. You can remove or shrink the reservation if you know that the entire device won't be overwritten. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with Equallogic storage
On 08/21/10 10:14, Ross Walker wrote: I am trying to figure out the best way to provide both performance and resiliency given the Equallogic provides the redundancy. (I have no specific experience with Equallogic; the following is just generic advice) Every bit stored in zfs is checksummed at the block level; zfs will not use data or metadata if the checksum doesn't match. zfs relies on redundancy (storing multiple copies) to provide resilience; if it can't independently read the multiple copies and pick the one it likes, it can't recover from bitrot or failure of the underlying storage. if you want resilience, zfs must be responsible for redundancy. You imply having multiple storage servers. The simplest thing to do is export one large LUN from each of two different storage servers, and have ZFS mirror them. While this reduces the available space, depending on your workload, you can make some of it back by enabling compression. And, given sufficiently recent software, and sufficient memory and/or ssd for l2arc, you can enable dedup. Of course, the effectiveness of both dedup and compression depends on your workload. Would I be better off forgoing resiliency for simplicity, putting all my faith into the Equallogic to handle data resiliency? IMHO, no; the resulting system will be significantly more brittle. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
On 09/09/10 20:08, Edward Ned Harvey wrote: Scores so far: 2 No 1 Yes No. resilver does not re-layout your data or change whats in the block pointers on disk. if it was fragmented before, it will be fragmented after. C) Does zfs send zfs receive mean it will defrag? Scores so far: 1 No 2 Yes "maybe". If there is sufficient contiguous freespace in the destination pool, files may be less fragmented. But if you do incremental sends of multiple snapshots, you may well replicate some or all the fragmentation on the origin (because snapshots only copy the blocks that change, and receiving an incremental send does the same). And if the destination pool is short on space you may end up more fragmented than the source. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How do you use >1 partition on x86?
So when I built my new workstation last year, I partitioned the one and only disk in half, 50% for Windows, 50% for 2009.06. Now, I'm not using Windows, so I'd like to use the other half for another ZFS pool, but I can't figure out how to access it. I have used fdisk to create a second Solaris2 partition, did a re-con reboot, but format still only shows the 1 available partition. How do I used the second partition? selecting c7t0d0 Total disk size is 30401 cylinders Cylinder size is 16065 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 Other OS 0 4 5 0 2 IFS: NTFS 5 19171913 6 3 ActiveSolaris2 1917 1497113055 43 4 Solaris2 14971 3017015200 50 format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c7t0d0 /p...@0,0/pci1028,2...@1f,2/d...@0,0 Thanks for any idea. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express
On 11/17/10 12:04, Miles Nordin wrote: black-box crypto is snake oil at any level, IMNSHO. Absolutely. Congrats again on finishing your project, but every other disk encryption framework I've seen taken remotely seriously has a detailed paper describing the algorithm, not just a list of features and a configuration guide. It should be a requirement for anything treated as more than a toy. I might have missed yours, or maybe it's coming soon. In particular, the mechanism by which dedup-friendly block IV's are chosen based on the plaintext needs public scrutiny. Knowing Darren, it's very likely that he got it right, but in crypto, all the details matter and if a spec detailed enough to allow for interoperability isn't available, it's safest to assume that some of the details are wrong. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Looking for 3.5" SSD for ZIL
> > got it attached to a UPS with very conservative > shut-down timing. Or > > are there other host failures aside from power a > ZIL would be > > vulnerable too (system hard-locks?)? > > Correct, a system hard-lock is another example... How about comparing a non-battery backed ZIL to running a ZFS dataset with sync=disabled. Which is more risky? This has been an educational thread for me...I was not aware that SSD drives had some DRAM in front of the SSD part? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] BOOT, ZIL, L2ARC one one SSD?
60GB SSD drives using the SF 1222 controller can be had now for around $100. I know ZFS likes to use the entire disk to do it's magic, but under X86, is the entire disk the entire disk, or is it one physical X86 partition? In the past I have created 2 partitions with FDISK, but format will only show one of them? Did I do something wrong, or is that the way it works? So, maybe what I want to do won't workBut this is my thought on a single 60GB SSD drive, use FDISK to create 3 physical partitions, a 20GB for boot, a 30GB for L2ARC and a 10GB for ZIL? Or is 3 physical Solaris partitions on a disk not considered the entire disk as far as ZFS is concerned? Can a ZIL and/or L2ARC be shared amongst 1+ ZPOOLs, or must each pool have it's own? If each pool must have it's own, can a disk be partitioned so a single fast SSD can be shared amongst 1+ pools? Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] BOOT, ZIL, L2ARC one one SSD?
Understood Edward, and if this was a production data center, I wouldn't be doing it this way. This is for my home lab, so spending hundreds of dollars on SSD devices isn't practical. Can several datasets share a single ZIL and a single L2ARC, or much must each dataset have their own? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS advice for laptop
On 01/04/11 18:40, Bob Friesenhahn wrote: Zfs will disable write caching if it sees that a partition is being used This is backwards. ZFS will enable write caching on a disk if a single pool believes it owns the whole disk. Otherwise, it will do nothing to caching. You can enable it yourself with the format command and ZFS won't disable it. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On 02/07/11 11:49, Yi Zhang wrote: The reason why I tried that is to get the side effect of no buffering, which is my ultimate goal. ultimate = "final". you must have a goal beyond the elimination of buffering in the filesystem. if the writes are made durable by zfs when you need them to be durable, why does it matter that it may buffer data while it is doing so? - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On 02/07/11 12:49, Yi Zhang wrote: If buffering is on, the running time of my app doesn't reflect the actual I/O cost. My goal is to accurately measure the time of I/O. With buffering on, ZFS would batch up a bunch of writes and change both the original I/O activity and the time. if batching main pool writes improves the overall throughput of the system over a more naive i/o scheduling model, don't you want your users to see the improvement in performance from that batching? why not set up a steady-state sustained workload that will run for hours, and measure how long it takes the system to commit each 1000 or 1 transactions in the middle of the steady state workload? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send/recv initial data load
On 02/16/11 07:38, white...@gmail.com wrote: Is it possible to use a portable drive to copy the initial zfs filesystem(s) to the remote location and then make the subsequent incrementals over the network? Yes. > If so, what would I need to do to make sure it is an exact copy? Thank you, Rough outline: plug removable storage into source or a system near the source. zpool create backup pool on removable storage use an appropriate combination of zfs send & zfs receive to copy bits. zpool export backup pool. unplug removable storage move it plug it in to remote server zpool import backup pool use zfs send -i to verify that incrementals work (I did something like the above when setting up my home backup because I initially dinked around with the backup pool hooked up to a laptop and then moved it to a desktop system). optional: use zpool attach to mirror the removable storage to something faster/better/..., then after the mirror completes zpool detach to free up the removable storage. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] time-sliderd doesn't remove snapshots
In the last few days my performance has gone to hell. I'm running: # uname -a SunOS nissan 5.11 snv_150 i86pc i386 i86pc (I'll upgrade as soon as the desktop hang bug is fixed.) The performance problems seem to be due to excessive I/O on the main disk/pool. The only things I've changed recently is that I've created and destroyed a snapshot, and I used "zpool upgrade". Here's what I'm seeing: # zpool iostat rpool 5 capacity operationsbandwidth poolalloc free read write read write -- - - - - - - rpool 13.3G 807M 7 85 15.9K 548K rpool 13.3G 807M 3 89 1.60K 723K rpool 13.3G 810M 5 91 5.19K 741K rpool 13.3G 810M 3 94 2.59K 756K Using iofileb.d from the dtrace toolkit shows: # iofileb.d Tracing... Hit Ctrl-C to end. ^C PID CMD KB FILE 0 sched 6 5 zpool-rpool7770 zpool status doesn't show any problems: # zpool status rpool pool: rpool state: ONLINE scan: none requested config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 c3d0s0ONLINE 0 0 0 Perhaps related to this or perhaps not, I discovered recently that time-sliderd was doing just a ton of "close" requests. I disabled time-sliderd while trying to solve my performance problem. I was also getting these error messages in the time-sliderd log file: Warning: Cleanup failed to destroy: rpool/ROOT@zfs-auto-snap_hourly-2010-11-10-15h01 Details: ['/usr/bin/pfexec', '/usr/sbin/zfs', 'destroy', '-d', 'rpool/ROOT@zfs-auto-snap_hourly-2010-11-10-15h01'] failed with exit code 1 cannot destroy 'rpool/ROOT@zfs-auto-snap_hourly-2010-11-10-15h01': unsupported version That was the reason I did the zpool upgrade. I discovered that I had a *ton* of snapshots from time-slider that hadn't been destroyed, over 6500 of them, presumably all because of this version problem? I manually removed all the snapshots and my performance returned to normal. I don't quite understand what the "-d" option to "zfs destroy" does. Why does time-sliderd use it, and why does it prevent these snapshots from being destroyed? Shouldn't time-sliderd detect that it can't destroy any of the snapshots it's created and stop creating snapshots? And since I don't quite understand why time-sliderd was failing to begin with, I'm nervous about re-enabling it. Do I need to do a "zpool upgrade" on all my pools to make it work? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] time-sliderd doesn't remove snapshots
One of my old pools was version 10, another was version 13. I guess that explains the problem. Seems like time-sliderd should refuse to run on pools that aren't of a sufficient version. Cindy Swearingen wrote on 02/18/11 12:07 PM: Hi Bill, I think the root cause of this problem is that time slider implemented the zfs destroy -d feature but this feature is only available in later pool versions. This means that the routine removal of time slider generated snapshots fails on older pool versions. The zfs destroy -d feature (snapshot user holds) was introduced in pool version 18. I think this bug describes some or all of the problem: https://defect.opensolaris.org/bz/show_bug.cgi?id=16361 Thanks, Cindy On 02/18/11 12:34, Bill Shannon wrote: In the last few days my performance has gone to hell. I'm running: # uname -a SunOS nissan 5.11 snv_150 i86pc i386 i86pc (I'll upgrade as soon as the desktop hang bug is fixed.) The performance problems seem to be due to excessive I/O on the main disk/pool. The only things I've changed recently is that I've created and destroyed a snapshot, and I used "zpool upgrade". Here's what I'm seeing: # zpool iostat rpool 5 capacity operationsbandwidth poolalloc free read write read write -- - - - - - - rpool 13.3G 807M 7 85 15.9K 548K rpool 13.3G 807M 3 89 1.60K 723K rpool 13.3G 810M 5 91 5.19K 741K rpool 13.3G 810M 3 94 2.59K 756K Using iofileb.d from the dtrace toolkit shows: # iofileb.d Tracing... Hit Ctrl-C to end. ^C PID CMD KB FILE 0 sched 6 5 zpool-rpool7770 zpool status doesn't show any problems: # zpool status rpool pool: rpool state: ONLINE scan: none requested config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 c3d0s0ONLINE 0 0 0 Perhaps related to this or perhaps not, I discovered recently that time-sliderd was doing just a ton of "close" requests. I disabled time-sliderd while trying to solve my performance problem. I was also getting these error messages in the time-sliderd log file: Warning: Cleanup failed to destroy: rpool/ROOT@zfs-auto-snap_hourly-2010-11-10-15h01 Details: ['/usr/bin/pfexec', '/usr/sbin/zfs', 'destroy', '-d', 'rpool/ROOT@zfs-auto-snap_hourly-2010-11-10-15h01'] failed with exit code 1 cannot destroy 'rpool/ROOT@zfs-auto-snap_hourly-2010-11-10-15h01': unsupported version That was the reason I did the zpool upgrade. I discovered that I had a *ton* of snapshots from time-slider that hadn't been destroyed, over 6500 of them, presumably all because of this version problem? I manually removed all the snapshots and my performance returned to normal. I don't quite understand what the "-d" option to "zfs destroy" does. Why does time-sliderd use it, and why does it prevent these snapshots from being destroyed? Shouldn't time-sliderd detect that it can't destroy any of the snapshots it's created and stop creating snapshots? And since I don't quite understand why time-sliderd was failing to begin with, I'm nervous about re-enabling it. Do I need to do a "zpool upgrade" on all my pools to make it work? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Format returning bogus controller info
On 02/26/11 17:21, Dave Pooser wrote: While trying to add drives one at a time so I can identify them for later use, I noticed two interesting things: the controller information is unlike any I've seen before, and out of nine disks added after the boot drive all nine are attached to c12 -- and no single controller has more than eight ports. on your system, c12 is the mpxio virtual controller; any disk which is potentially multipath-able (and that includes the SAS drives) will appear as a child of the virtual controller (rather than appear as the child of two or more different physical controllers). see stmsboot(1m) for information on how to turn that off if you don't need multipathing and don't like the longer device names. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Old posts to zfs-discuss
Sorry for the old posts that some of you are seeing to zfs-discuss. The link between Jive and mailman was broken so I fixed that. However, once this was fixed Jive started sending every single post from the zfs-discuss board on Jive to the mail list. Quite a few posts were sent before I realized what was happening and was able to kill the process. Bill Rushmore ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] not sure how to make filesystems
I'm migrating some filesystems from UFS to ZFS and I'm not sure how to create a couple of them. I want to migrate /, /var, /opt, /export/home and also want swap and /tmp. I don't care about any of the others. The first disk, and the one with the UFS filesystems, is c0t0d0 and the 2nd disk is c0t1d0. I've been told that /tmp is supposed to be part of swap. So far I have: lucreate -m /:/dev/dsk/c0t0d0s0:ufs -m /var:/dev/dsk/c0t0d0s3:ufs -m /export/home:/dev/dsk/c0t0d0s5:ufs -m /opt:/dev/dsk/c0t0d0s4:ufs -m -:/dev/dsk/c0t1d0s2:swap -m /tmp:/dev/dsk/c0t1d0s3:swap-n zfsBE -p rootpool And then set quotas for them. Is this right? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
On 05/31/11 09:01, Anonymous wrote: > Hi. I have a development system on Intel commodity hardware with a 500G ZFS > root mirror. I have another 500G drive same as the other two. Is there any > way to use this disk to good advantage in this box? I don't think I need any > more redundancy, I would like to increase performance if possible. I have > only one SATA port left so I can only use 3 drives total unless I buy a PCI > card. Would you please advise me. Many thanks. I'd use the extra SATA port for an ssd, and use that ssd for some combination of boot/root, ZIL, and L2ARC. I have a couple systems in this configuration now and have been quite happy with the config. While slicing an ssd and using one slice for root, one slice for zil, and one slice for l2arc isn't optimal from a performance standpoint and won't scale up to a larger configuration, it is a noticeable improvement from a 2-disk mirror. I used an 80G intel X25-M, with 1G for zil, with the rest split roughly 50:50 between root pool and l2arc for the data pool. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Available space confusion
On 06/06/11 08:07, Cyril Plisko wrote: zpool reports space usage on disks, without taking into account RAIDZ overhead. zfs reports net capacity available, after RAIDZ overhead accounted for. Yup. Going back to the original numbers: nebol@filez:/$ zfs list tank2 NAMEUSED AVAIL REFER MOUNTPOINT tank2 3.12T 902G 32.9K /tank2 Given that it's a 4-disk raidz1, you have (roughly) one block of parity for every three blocks of data. 3.12T / 3 = 1.04T so 3.12T + 1.04T = 4.16T, which is close to the 4.18T showed by zpool list: NAMESIZE USED AVAILCAP HEALTH ALTROOT tank2 5.44T 4.18T 1.26T76% ONLINE ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
On 06/08/11 01:05, Tomas Ögren wrote: And if pool usage is>90%, then there's another problem (change of finding free space algorithm). Another (less satisfying) workaround is to increase the amount of free space in the pool, either by reducing usage or adding more storage. Observed behavior is that allocation is fast until usage crosses a threshhold, then performance hits a wall. I have a small sample size (maybe 2-3 samples), but the threshhold point varies from pool to pool but tends to be consistent for a given pool. I suspect some artifact of layout/fragmentation is at play. I've seen things hit the wall at as low as 70% on one pool. The original poster's pool is about 78% full. If possible, try freeing stuff until usage goes back under 75% or 70% and see if your performance returns. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk replacement need to scan full pool ?
On 06/14/11 04:15, Rasmus Fauske wrote: > I want to replace some slow consumer drives with new edc re4 ones but > when I do a replace it needs to scan the full pool and not only that > disk set (or just the old drive) > > Is this normal ? (the speed is always slow in the start so thats not > what I am wondering about, but that it needs to scan all of my 18.7T to > replace one drive) This is normal. The resilver is not reading all data blocks; it's reading all of the metadata blocks which contain one or more block pointers, which is the only way to find all the allocated data (and in the case of raidz, know precisely how it's spread and encoded across the members of the vdev). And it's reading all the data blocks needed to reconstruct the disk to be replaced. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenIndiana | ZFS | scrub | network | awful slow
On 06/16/11 15:36, Sven C. Merckens wrote: > But is the L2ARC also important while writing to the device? Because > the storeges are used most of the time only for writing data on it, > the Read-Cache (as I thought) isn´t a performance-factor... Please > correct me, if my thoughts are wrong. if you're using dedup, you need a large read cache even if you're only doing application-layer writes, because you need fast random read access to the dedup tables while you write. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption accelerator card recommendations.
On 06/27/11 15:24, David Magda wrote: > Given the amount of transistors that are available nowadays I think > it'd be simpler to just create a series of SIMD instructions right > in/on general CPUs, and skip the whole co-processor angle. see: http://en.wikipedia.org/wiki/AES_instruction_set Present in many current Intel CPUs; also expected to be present in AMD's "Bulldozer" based CPUs. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] "zfs diff" performance disappointing
On 09/26/11 12:31, Nico Williams wrote: > On Mon, Sep 26, 2011 at 1:55 PM, Jesus Cea wrote: >> Should I disable "atime" to improve "zfs diff" performance? (most data >> doesn't change, but "atime" of most files would change). > > atime has nothing to do with it. based on my experiences with time-based snapshots and atime on a server which had cron-driven file tree walks running every night, I can easily believe atime has a lot to do with it - the atime updates associated with a tree walk will mean that that much of a filesystem's metadata will diverge between the writeable filesystem and its last snapshot. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
I ran a scrub on a root pool after upgrading to snv_94, and got checksum errors: pool: r00t state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h26m with 1 errors on Thu Jul 17 14:52:14 2008 config: NAME STATE READ WRITE CKSUM r00t ONLINE 0 0 2 mirror ONLINE 0 0 2 c4t0d0s0 ONLINE 0 0 4 c4t1d0s0 ONLINE 0 0 4 I ran it again, and it's now reporting the same errors, but still says "applications are unaffected": pool: r00t state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h27m with 2 errors on Thu Jul 17 20:24:15 2008 config: NAME STATE READ WRITE CKSUM r00t ONLINE 0 0 4 mirror ONLINE 0 0 4 c4t0d0s0 ONLINE 0 0 8 c4t1d0s0 ONLINE 0 0 8 errors: No known data errors I wonder if I'm running into some combination of: 6725341 Running 'zpool scrub' repeatedly on a pool show an ever increasing error count and maybe: 6437568 ditto block repair is incorrectly propagated to root vdev Any way to dig further to determine what's going on? - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94
On Fri, 2008-07-18 at 10:28 -0700, Jürgen Keil wrote: > > I ran a scrub on a root pool after upgrading to snv_94, and got checksum > > errors: > > Hmm, after reading this, I started a zpool scrub on my mirrored pool, > on a system that is running post snv_94 bits: It also found checksum errors > > # zpool status files > pool: files > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. >see: http://www.sun.com/msg/ZFS-8000-9P > scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008 > config: > > NAME STATE READ WRITE CKSUM > files DEGRADED 0 018 > mirror DEGRADED 0 018 > c8t0d0s6 DEGRADED 0 036 too many errors > c9t0d0s6 DEGRADED 0 036 too many errors > > errors: No known data errors out of curiosity, is this a root pool? A second system of mine with a mirrored root pool (and an additional large multi-raidz pool) shows the same symptoms on the mirrored root pool only. once is accident. twice is coincidence. three times is enemy action :-) I'll file a bug as soon as I can (I'm travelling at the moment with spotty connectivity), citing my and your reports. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can I trust ZFS?
On Sun, 2008-08-03 at 11:42 -0500, Bob Friesenhahn wrote: > Zfs makes human error really easy. For example > >$ zpool destroy mypool Note that "zpool destroy" can be undone by "zpool import -D" (if you get to it before the disks are overwritten). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Checksum error: which of my files have failed scrubbing?
On Tue, 2008-08-05 at 12:11 -0700, soren wrote: > > soren wrote: > > > ZFS has detected that my root filesystem has a > > small number of errors. Is there a way to tell which > > specific files have been corrupted? > > > > After a scrub a zpool status -v should give you a > > list of files with > > unrecoverable errors. > > Hmm, I just tried that. Perhaps "No known data errors" means that my files > are OK. In that case I wonder what the checksum failure was from. If this is build 94 and you have one or more unmounted filesystems, (such as alternate boot environments), these errors are false positives. There is no actual error; the scrubber misinterpreted the end of an intent log block chain as a checksum error. the bug id is: 6727872 zpool scrub: reports checksum errors for pool with zfs and unplayed ZIL This bug is fixed in build 95. One workaround is to mount the filesystems and then unmount them to apply the intent log changes. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Block unification in ZFS
See the long thread titled "ZFS deduplication", last active approximately 2 weeks ago. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] more ZFS recovery
On Thu, 2008-08-07 at 11:34 -0700, Richard Elling wrote: > How would you describe the difference between the data recovery > utility and ZFS's normal data recovery process? I'm not Anton but I think I see what he's getting at. Assume you have disks which once contained a pool but all of the uberblocks have been clobbered. So you don't know where the root of the block tree is, but all the actual data is there, intact, on the disks. Given the checksums you could rebuild one or more plausible structure of the pool from the bottom up. I'd think that you could construct an offline zpool data recovery tool where you'd start with N disk images and a large amount of extra working space, compute checksums of all possible data blocks on the images, scan the disk images looking for things that might be valid block pointers, and attempt to stitch together subtrees of the filesystem and recover as much as you can even if many upper nodes in the block tree have had holes shot in them by a miscreant device. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best layout for 15 disks?
On Thu, 2008-08-21 at 21:15 -0700, mike wrote: > I've seen 5-6 disk zpools are the most recommended setup. This is incorrect. Much larger zpools built out of striped redundant vdevs (mirror, raidz1, raidz2) are recommended and also work well. raidz1 or raidz2 vdevs of more than a single-digit number of drives are not recommended. so, for instance, the following is an appropriate use of 12 drives in two raidz2 sets of 6 disks, with 8 disks worth of raw space available: zpool create mypool raidz2 disk0 disk1 disk2 disk3 disk4 disk5 zpool add mypool raidz2 disk6 disk7 disk8 disk9 disk10 disk11 > In traditional RAID terms, I would like to do RAID5 + hot spare (13 > disks usable) out of the 15 disks (like raidz2 I suppose). What would > make the most sense to setup 15 disks with ~ 13 disks of usable space? Enable compression, and set up multiple raidz2 groups. Depending on what you're storing, you may get back more than you lose to parity. > This is for a home fileserver, I do not need HA/hotplugging/etc. so I > can tolerate a failure and replace it with plenty of time. It's not > mission critical. That's a lot of spindles for a home fileserver. I'd be inclined to go with a smaller number of larger disks in mirror pairs, allowing me to buy larger disks in pairs as they come on the market to increase capacity. > Same question, but 10 disks, and I'd sacrifice one for parity then. > Not two. so ~9 disks usable roughly (like raidz) zpool create mypool raidz1 disk0 disk1 disk2 disk3 disk4 zpool add mypool raidz1 disk5 disk6 disk7 disk8 disk9 8 disks raw capacity, can survive the loss of any one disk or the loss of two disks in different raidz groups. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote: > A better option would be to not use this to perform FMA diagnosis, but > instead work into the mirror child selection code. This has already > been alluded to before, but it would be cool to keep track of latency > over time, and use this to both a) prefer one drive over another when > selecting the child and b) proactively timeout/ignore results from one > child and select the other if it's taking longer than some historical > standard deviation. This keeps away from diagnosing drives as faulty, > but does allow ZFS to make better choices and maintain response times. > It shouldn't be hard to keep track of the average and/or standard > deviation and use it for selection; proactively timing out the slow I/Os > is much trickier. tcp has to solve essentially the same problem: decide when a response is "overdue" based only on the timing of recent successful exchanges in a context where it's difficult to make assumptions about "reasonable" expected behavior of the underlying network. it tracks both the smoothed round trip time and the variance, and declares a response overdue after (SRTT + K * variance). I think you'd probably do well to start with something similar to what's described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on experience. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sidebar to ZFS Availability discussion
On Sun, 2008-08-31 at 12:00 -0700, Richard Elling wrote: > 2. The algorithm *must* be computationally efficient. >We are looking down the tunnel at I/O systems that can >deliver on the order of 5 Million iops. We really won't >have many (any?) spare cycles to play with. If you pick the constants carefully (powers of two) you can do the TCP RTT + variance estimation using only a handful of shifts, adds, and subtracts. > In both of these cases, the solutions imply multi-minute timeouts are > required to maintain a stable system. Again, there are different uses for timeouts: 1) how long should we wait on an ordinary request before deciding to try "plan B" and go elsewhere (a la B_FAILFAST) 2) how long should we wait (while trying all alternatives) before declaring an overall failure and giving up. The RTT estimation approach is really only suitable for the former, where you have some alternatives available (retransmission in the case of TCP; trying another disk in the case of mirrors, etc.,). when you've tried all the alternatives and nobody's responding, there's no substitute for just retrying for a long time. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sidebar to ZFS Availability discussion
On Sun, 2008-08-31 at 15:03 -0400, Miles Nordin wrote: > It's sort of like network QoS, but not quite, because: > > (a) you don't know exactly how big the ``pipe'' is, only > approximately, In an ip network, end nodes generally know no more than the pipe size of the first hop -- and in some cases (such as true CSMA networks like classical ethernet or wireless) only have an upper bound on the pipe size. beyond that, they can only estimate the characteristics of the rest of the network by observing its behavior - all they get is end-to-end latency, and *maybe* a 'congestion observed' mark set by an intermediate system. > (c) all the fabrics are lossless, so while there are queues which > undesireably fill up during congestion, these queues never drop > ``packets'' but instead exert back-pressure all the way up to > the top of the stack. hmm. I don't think the back pressure makes it all the way up to zfs (the top of the block storage stack) except as added latency. (on the other hand, if it did, zfs could schedule around it both for reads and writes, avoiding pouring more work on already-congested paths..) > I'm surprised we survive as well as we do without disk QoS. Are the > storage vendors already doing it somehow? I bet that (as with networking) in many/most cases overprovisioning the hardware and running at lower average utilization is often cheaper in practice than running close to the edge and spending a lot of expensive expert time monitoring performance and tweaking QoS parameters. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss