[zfs-discuss] SunOS neptune 5.11 snv_127 sun4u sparc SUNW, Sun-Fire-880
I just went through a BFU update to snv_127 on a V880 : neptune console login: root Password: Nov 3 08:19:12 neptune login: ROOT LOGIN /dev/console Last login: Mon Nov 2 16:40:36 on console Sun Microsystems Inc. SunOS 5.11 snv_127 Nov. 02, 2009 SunOS Internal Development: root 2009-Nov-02 [onnv_127-tonic] bfu'ed from /build/archives-nightly-osol/sparc on 2009-11-03 I have [ high ] hopes that there was a small tarball somewhere which contained the sources listed in : http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010683.html Is there such a tarball anywhere at all or shall I just wait for the putback to hit the mercurial repo ? Yes .. this is sort of begging .. but I call it "enthusiasm" :-) -- Dennis Clarke dcla...@opensolaris.ca <- Email related to the open source Solaris dcla...@blastwave.org <- Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS dedup issue
Hi, Lets take a look: # zpool list NAMESIZE USED AVAILCAP DEDUP HEALTH ALTROOT rpool68G 13.9G 54.1G20% 42.27x ONLINE - # zfs get all rpool/export/data NAME PROPERTYVALUE SOURCE rpool/export/data typefilesystem - rpool/export/data creationMon Nov 2 16:11 2009 - rpool/export/data used46.7G - rpool/export/data available 38.7M - rpool/export/data referenced 46.7G - rpool/export/data compressratio 1.00x - rpool/export/data mounted yes - rpool/export/data quota none default rpool/export/data reservation none default rpool/export/data recordsize 128K default rpool/export/data mountpoint /export/data inherited from rpool/export rpool/export/data sharenfsoff default rpool/export/data checksumon default rpool/export/data compression off default rpool/export/data atime on default rpool/export/data devices on default rpool/export/data execon default rpool/export/data setuid on default rpool/export/data readonlyoff default rpool/export/data zoned off default rpool/export/data snapdir hidden default rpool/export/data aclmode groupmask default rpool/export/data aclinherit restricted default rpool/export/data canmounton default rpool/export/data shareiscsi off default rpool/export/data xattr on default rpool/export/data copies 1 default rpool/export/data version 4 - rpool/export/data utf8onlyoff - rpool/export/data normalization none - rpool/export/data casesensitivity sensitive - rpool/export/data vscan off default rpool/export/data nbmand off default rpool/export/data sharesmboff default rpool/export/data refquotanone default rpool/export/data refreservation none default rpool/export/data primarycacheall default rpool/export/data secondarycache all default rpool/export/data usedbysnapshots 0 - rpool/export/data usedbydataset 46.7G - rpool/export/data usedbychildren 0 - rpool/export/data usedbyrefreservation0 - rpool/export/data logbias latency default rpool/export/data dedup on local rpool/export/data org.opensolaris.caiman:install ready inherited from rpool # df -h FilesystemSize Used Avail Use% Mounted on rpool/ROOT/os_b123_dev 2.4G 2.4G 40M 99% / swap 9.1G 336K 9.1G 1% /etc/svc/volatile /usr/lib/libc/libc_hwcap1.so.1 2.4G 2.4G 40M 99% /lib/libc.so.1 swap 9.1G 0 9.1G 0% /tmp swap 9.1G 40K 9.1G 1% /var/run rpool/export 40M 25K 40M 1% /export rpool/export/home 40M 30K 40M 1% /export/home rpool/export/home/admin 460M 421M 40M 92% /export/home/admin rpool 40M 83K 40M 1% /rpool rpool/expo
Re: [zfs-discuss] dedup question
On 2-Nov-09, at 3:16 PM, Nicolas Williams wrote: On Mon, Nov 02, 2009 at 11:01:34AM -0800, Jeremy Kitchen wrote: forgive my ignorance, but what's the advantage of this new dedup over the existing compression option? Wouldn't full-filesystem compression naturally de-dupe? ... There are many examples where snapshot/clone isn't feasible but dedup can help. For example: mail stores (though they can do dedup at the application layer by using message IDs and hashes). For example: home directories (think of users saving documents sent via e-mail). For example: source code workspaces (ONNV, Xorg, Linux, whatever), where users might not think ahead to snapshot/clone a local clone (I also tend to maintain a local SCM clone that I then snapshot/clone to get workspaces for bug fixes and projects; it's a pain, really). I'm sure there are many, many other examples. A couple that come to mind... Some patterns become much cheaper with dedup: - The Subversion working copy format where you have the reference checked out file alongside the working file - QA/testing system where you might have dozens or hundreds of builds of iterations an application, mostly identical Exposing checksum metadata might have interesting implications for operations like diff, cmp, rsync, even tar. --Toby The workspace example is particularly interesting: with the snapshot/clone approach you get to deduplicate the _source code_, but not the _object code_, while with dedup you get both dedup'ed automatically. As for compression, that helps whether you dedup or not, and it helps by about the same factor either way -- dedup and compression are unrelated, really. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SunOS neptune 5.11 snv_127 sun4u sparc SUNW, Sun-Fire-880
Dennis Clarke wrote: I just went through a BFU update to snv_127 on a V880 : neptune console login: root Password: Nov 3 08:19:12 neptune login: ROOT LOGIN /dev/console Last login: Mon Nov 2 16:40:36 on console Sun Microsystems Inc. SunOS 5.11 snv_127 Nov. 02, 2009 SunOS Internal Development: root 2009-Nov-02 [onnv_127-tonic] bfu'ed from /build/archives-nightly-osol/sparc on 2009-11-03 I have [ high ] hopes that there was a small tarball somewhere which contained the sources listed in : http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010683.html Is there such a tarball anywhere at all or shall I just wait for the putback to hit the mercurial repo ? Yes .. this is sort of begging .. but I call it "enthusiasm" :-) Hi Dennis, we haven't done source tarballs or Mercurial bundles in quite some time, since it's more efficient for you to pull from the Mercurial repo and build it yourself :) Also, the build 127 tonic bits that I generated today (and which you appear to be using) won't contain Jeff's push from yesterday, because that changeset is part of build 128 - and I haven't closed the build yet. The push is in the repo, btw: changeset: 10922:e2081f502306 user:Jeff Bonwick date:Sun Nov 01 14:14:46 2009 -0800 comments: PSARC 2009/571 ZFS Deduplication Properties 6677093 zfs should have dedup capability cheers, James -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs on multiple machines
Hi, is it possible to link multiple machines into one storage pool using zfs? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] More Dedupe Questions...
Tristan Ball wrote: I'm curious as to how send/recv intersects with dedupe... if I send/recv a deduped filesystem, is the data sent it it's de-duped form, ie just sent once, followed by the pointers for subsequent dupe data, or is the the data sent in expanded form, with the recv side system then having to redo the dedupe process? The on disk dedup and dedup of the stream are actually separate features. Stream dedup hasn't yet integrated. It will be a choice at *send* time if the stream is to be deduplicated. Obviously sending it deduped is more efficient in terms of bandwidth and CPU time on the recv side, but it may also be more complicated to achieve? A stream can be deduped even if the on disk format isn't and vice versa. Also - do we know yet what affect block size has on dedupe? My guess is that a smaller block size will perhaps give a better duplication match rate, but at the cost of higher CPU usage and perhaps reduced performance, as the system will need to store larger de-dupe hash tables? That really depends on how the applications write blocks and what your data is like. It could go either way very easily. As with all dedup it is a trade off between IO bandwidth and CPU/memory. Sometimes dedup will improve performance, since like compression it can reduce IO requirements, but depending on workload the CPU/memory overhead may or may not be worth it (same with compression). -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on multiple machines
Miha Voncina wrote: Hi, is it possible to link multiple machines into one storage pool using zfs? Depends what you mean by this. Multiple machines can not import the same ZFS pool at the same time, doing so *will* cause corruption and ZFS tries hard to protect against multiple imports. However ZFS can use iSCSI LUNs from multiple target machines for its disks that make up a given pool. ZFS volumes (ZVOLS) can also be used as iSCSI targets and thus shared out to multiple machines. ZFS file systems can be shared over NFS and CIFS and thus shared by multiple machines. ZFS pools can be used in a Sun Cluster configuration but will only imported into a single node of a Sun Cluster configuration at a time. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SunOS neptune 5.11 snv_127 sun4u sparc SUNW, Sun-Fire-880
> Dennis Clarke wrote: >> I just went through a BFU update to snv_127 on a V880 : >> >> neptune console login: root >> Password: >> Nov 3 08:19:12 neptune login: ROOT LOGIN /dev/console >> Last login: Mon Nov 2 16:40:36 on console >> Sun Microsystems Inc. SunOS 5.11 snv_127 Nov. 02, 2009 >> SunOS Internal Development: root 2009-Nov-02 [onnv_127-tonic] >> bfu'ed from /build/archives-nightly-osol/sparc on 2009-11-03 >> >> I have [ high ] hopes that there was a small tarball somewhere which >> contained the sources listed in : >> >> http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010683.html >> >> Is there such a tarball anywhere at all or shall I just wait >> for the putback to hit the mercurial repo ? >> >> Yes .. this is sort of begging .. but I call it "enthusiasm" :-) > > Hi Dennis, > we haven't done source tarballs or Mercurial bundles in quite > some time, since it's more efficient for you to pull from the > Mercurial repo and build it yourself :) Well, funny you should mention it. I was this close ( -->|.|<-- ) to running a nightly build and then I had a minor brainwave .. "why bother?" because the sparc archive bits were there already. > Also, the build 127 tonic bits that I generated today (and > which you appear to be using) won't contain Jeff's push from > yesterday, because that changeset is part of build 128 - and > I haven't closed the build yet. > > The push is in the repo, btw: > > > changeset: 10922:e2081f502306 > user:Jeff Bonwick > date:Sun Nov 01 14:14:46 2009 -0800 > comments: > PSARC 2009/571 ZFS Deduplication Properties > 6677093 zfs should have dedup capability > funny .. I didn't see it last night. :-\ I'll blame the coffee and go get a "nightly" happening right away :-) Thanks for the reply! -- Dennis Clarke dcla...@opensolaris.ca <- Email related to the open source Solaris dcla...@blastwave.org <- Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup issue
> So.. it seems that data is deduplicated, zpool has > 54.1G of free space, but I can use only 40M. > > It's x86, ONNV revision 10924, debug build, bfu'ed from b125. I think I'm observing the same (with changeset 10936) ... I created a 2GB file, and a "tank" zpool on top of that file, with compression and dedup enabled: mkfile 2g /var/tmp/tank.img zpool create tank /var/tmp/tank.img zfs set dedup=on tank zfs set compression=on tank Now I tried to create four zfs filesystems, and filled them by pulling and updating the same set of onnv sources from mercurial. One copy needs ~ 800MB of disk space uncompressed, or ~ 520MB compressed. During the 4th "hg update": > hg update abort: No space left on device: /tank/snv_128_yy/usr/src/lib/libast/sparcv9/src/lib/libast/FEATURE/common > zpool list tank NAME SIZE USED AVAILCAP DEDUP HEALTH ALTROOT tank 1,98G 720M 1,28G35% 3.70x ONLINE - > zfs list -r tank NAME USED AVAIL REFER MOUNTPOINT tank 1,95G 026K /tank tank/snv_128 529M 0 529M /tank/snv_128 tank/snv_128_jk 530M 0 530M /tank/snv_128_jk tank/snv_128_xx 530M 0 530M /tank/snv_128_xx tank/snv_128_yy 368M 0 368M /tank/snv_128_yy -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup issue
> I think I'm observing the same (with changeset 10936) ... # mkfile 2g /var/tmp/tank.img # zpool create tank /var/tmp/tank.img # zfs set dedup=on tank # zfs create tank/foobar > dd if=/dev/urandom of=/tank/foobar/file1 bs=1024k count=512 512+0 records in 512+0 records out > cp /tank/foobar/file1 /tank/foobar/file2 > cp /tank/foobar/file1 /tank/foobar/file3 > cp /tank/foobar/file1 /tank/foobar/file4 /tank/foobar/file4: No space left on device > zfs list -r tank NAME USED AVAIL REFER MOUNTPOINT tank 1.95G 022K /tank tank/foobar 1.95G 0 1.95G /tank/foobar > zpool list tank NAME SIZE USED AVAILCAP DEDUP HEALTH ALTROOT tank 1.98G 515M 1.48G25% 3.90x ONLINE - -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris disk confusion ?
On 11/2/2009 9:23 PM, Marion Hakanson wrote: Could it be that c12t1d0 was at some time in the past (either in this machine or another machine) known as c3t11d0, and was part of a pool called "dbzpool"? quite possibly. but certainly not this host's dbzpool. You'll need to give the same "dd" treatment to the end of the disk as well; ZFS puts copies of its labels at the beginning and at the end. Oh, and im not sure what you mean here - I thought p0 was the entire disk in x86 - and s2 was the whole disk in the partition. what else should i overwrite? Thanks, -- Jeremy Kister http://jeremy.kister.net./ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CR6894234 -- improved sgid directory compatibility with non-Solaris NFS clients
On Nov 2, 2009, at 2:38 PM, "Paul B. Henson" wrote: On Sat, 31 Oct 2009, Al Hopper wrote: Kudos to you - nice technical analysis and presentation, Keep lobbying your point of view - I think interoperability should win out if it comes down to an arbitrary decision. Thanks; but so far that doesn't look promising. Right now I've got a cron job running every hour on the backend servers crawling around and fixing permissions on new directories :(. You would have thought something like this would have been noticed in one of the NFS interoperability bake offs. Paul, Maybe your approaching this the wrong way. Maybe this isn't an interoperability fix, but a security fix as it allows non-Sun clients to bypass security restrictions placed on a sgid protected directory tree because it doesn't properly test the existence of that bit upon file creation. If an appropriate scenario can be made, and I'm sure it can, one might even post a CERT advisory to make sure operators are made aware of this potential security problem. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe is in
I was under the impression that you can create a new zfs dataset and turn on the dedup functionality, and copy your data to it. Or am I wrong? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Location of ZFS documentation (source)?
Alex, You can download the man page source files from this URL: http://dlc.sun.com/osol/man/downloads/current/ If you want a different version, you can navigate to the available source consolidations from the Downloads page on opensolaris.org. Thanks, Cindy On 11/02/09 16:39, Cindy Swearingen wrote: Hi Alex, I'm checking with some folks on how we handled this handoff for the previous project. I'll get back to you shortly. Thanks, Cindy On 11/02/09 16:07, Alex Blewitt wrote: The man pages documentation from the old Apple port (http://github.com/alblue/mac-zfs/tree/master/zfs_documentation/man8/) don't seem to have a corresponding source file in the onnv-gate repository (http://hub.opensolaris.org/bin/view/Project+onnv/WebHome) although I've found the text on-line (http://docs.sun.com/app/docs/doc/819-2240/zfs-1m) Can anyone point me to where these are stored, so that we can update the documentation in the Apple fork? Alex ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] De-Dupe and iSCSI
Good morning all... Great work on the De-Dupe stuff. cant wait to try it out. but quick question about iSCSI and De-Dupe. will it work? if i share out a ZVOL to another machine and copy some simular files to it (thinking VMs) will they get de-duplicated? Thanks. -- Tiernan O'Toole blog.lotas-smartman.net www.tiernanotoolephotography.com www.the-hairy-one.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe is in
Orvar Korvar wrote: I was under the impression that you can create a new zfs dataset and turn on the dedup functionality, and copy your data to it. Or am I wrong? you don't even have to create a new dataset just do: # zfs set dedup=on -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] De-Dupe and iSCSI
Tiernan OToole wrote: Good morning all... Great work on the De-Dupe stuff. cant wait to try it out. but quick question about iSCSI and De-Dupe. will it work? if i share out a ZVOL to another machine and copy some simular files to it (thinking VMs) will they get de-duplicated? It works but how much benefit you will get from it, since it is block not file based, depends on what type of filesystem and/or application is on the iSCSI target. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CR6894234 -- improved sgid directory compatibility with non-Solaris NFS clients
On Tue, 3 Nov 2009, Ross Walker wrote: > Maybe this isn't an interoperability fix, but a security fix as it allows > non-Sun clients to bypass security restrictions placed on a sgid > protected directory tree because it doesn't properly test the existence > of that bit upon file creation. > > If an appropriate scenario can be made, and I'm sure it can, one might > even post a CERT advisory to make sure operators are made aware of this > potential security problem. I agree it's a security issue, I think I mentioned that at some point in this thread. However, it doesn't allow a client to do something they couldn't do anyway. If the sgid bit was respected and the directory was created with the right group, the client could chgrp it to their primary group afterwards. The security issue isn't that an evil client will avail of this to end up with a directory owned by the wrong group, it's that a poor innocent client will end up with a directory owned by their primary group rather than the group of the parent directory, and any inherited group@ ACL will apply to the primary group, resulting in insecure and unintended access :(. Another possible security issue that came up while I was discussing this issue with one of the Linux NFSv4 developers is that relying upon the client to set the ownership of the directory results in a race condition and is in their opinion buggy. In between the time the client generates the mkdir request and sends it over the wire and the server receives it, someone else might have changed the permissions or group ownership of the parent directory, resulting in the explicitly specified group provided by the client being wrong. They refuse to implement this buggy behavior, and to quote them, "You should get Sun to fix their server". I'm trying to do that, but no luck so far ... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | hen...@csupomona.edu California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] More Dedupe Questions...
Hi Darren, More below... Darren J Moffat wrote: Tristan Ball wrote: Obviously sending it deduped is more efficient in terms of bandwidth and CPU time on the recv side, but it may also be more complicated to achieve? A stream can be deduped even if the on disk format isn't and vice versa. Is the send dedup'ing more efficient if the filesystem is already depdup'd? If both are enabled do they share anything? -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SNV_125 MPT warning in logfile
We see the same issue on a x4540 Thor system with 500G disks: lots of: ... Nov 3 16:41:46 uva.nl scsi: [ID 107833 kern.warning] WARNING: /p...@3c,0/pci10de,3...@f/pci1000,1...@0 (mpt5): Nov 3 16:41:46 encore.science.uva.nl Disconnected command timeout for Target 7 ... This system is running nv125 XvM. Seems to occur more when we are using vm-s. This of course causes very long interruptions on the vm-s as well... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] More Dedupe Questions...
Kyle McDonald wrote: Hi Darren, More below... Darren J Moffat wrote: Tristan Ball wrote: Obviously sending it deduped is more efficient in terms of bandwidth and CPU time on the recv side, but it may also be more complicated to achieve? A stream can be deduped even if the on disk format isn't and vice versa. Is the send dedup'ing more efficient if the filesystem is already depdup'd? If both are enabled do they share anything? -Kyle At this time, no. But very shortly we hope to tie the two together better to make use of the existing checksums and duplication info available in the on-disk and in-kernel structures. Lori ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup issue
On Nov 3, 2009, at 6:01 AM, Jürgen Keil wrote: I think I'm observing the same (with changeset 10936) ... # mkfile 2g /var/tmp/tank.img # zpool create tank /var/tmp/tank.img # zfs set dedup=on tank # zfs create tank/foobar This has to do with the fact that dedup space accounting is charged to all filesystems, regardless of whether blocks are deduped. To do otherwise is impossible, as there is no true "owner" of a block, and the fact that it may or may not be deduped is often beyond the control of a single filesystem. This has some interesting pathologies as the pool gets full. Namely, that ZFS will artificially enforce a limit on the logical size of the pool based on non-deduped data. This is obviously something that should be addressed. - Eric dd if=/dev/urandom of=/tank/foobar/file1 bs=1024k count=512 512+0 records in 512+0 records out cp /tank/foobar/file1 /tank/foobar/file2 cp /tank/foobar/file1 /tank/foobar/file3 cp /tank/foobar/file1 /tank/foobar/file4 /tank/foobar/file4: No space left on device zfs list -r tank NAME USED AVAIL REFER MOUNTPOINT tank 1.95G 022K /tank tank/foobar 1.95G 0 1.95G /tank/foobar zpool list tank NAME SIZE USED AVAILCAP DEDUP HEALTH ALTROOT tank 1.98G 515M 1.48G25% 3.90x ONLINE - -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris disk confusion ?
On Mon, November 2, 2009 20:23, Marion Hakanson wrote: > You'll need to give the same "dd" treatment to the end of the disk as > well; > ZFS puts copies of its labels at the beginning and at the end. Does anybody else see this as rather troubling? Obviously it's dangerous to get in the habit of doing this as a routine operation, which to read advice here is how people are thinking of it. It seems to me that something in ZFS's protective procedures is missing or astray or over-active -- being protective is good, but there needs to be a way to re-use a disk that's been used before, too. And frequently people are at a loss to even understand what the possible conflict might be. Maybe a doubling of the -f option should give as full an explanation as possible of what the evidence shows as previous use, and then let you override it if you really really insist? Or some other option? Or an entirely separate utility (or script)? What I basically want, I think, is a standard way to get an explanation of exactly what ZFS thinks the conflict in my new proposed use of a disk might be -- and then a standard and as-safe-as-possible way to tell it to go ahead and use the disk. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS dedup accounting
Hi Eric and all, Eric Schrock wrote: On Nov 3, 2009, at 6:01 AM, Jürgen Keil wrote: I think I'm observing the same (with changeset 10936) ... # mkfile 2g /var/tmp/tank.img # zpool create tank /var/tmp/tank.img # zfs set dedup=on tank # zfs create tank/foobar This has to do with the fact that dedup space accounting is charged to all filesystems, regardless of whether blocks are deduped. To do otherwise is impossible, as there is no true "owner" of a block It would be great if someone could explain why it is hard (impossible? not a good idea?) to account all data sets for at least one reference to each dedup'ed block and add this space to the total free space? This has some interesting pathologies as the pool gets full. Namely, that ZFS will artificially enforce a limit on the logical size of the pool based on non-deduped data. This is obviously something that should be addressed. Would the idea I mentioned not address this issue as well? Thanks, Nils ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup accounting
Hi, It looks interesting problem. Would it help if as ZFS detects dedup blocks, it can start increasing effective size of pool. It will create an anomaly with respect to total disk space, but it will still be accurate from each file system usage point of view. Basically, dedup is at block level, so space freed can effectively be accounted as extra free blocks added to pool. Just a thought. Regards, Anurag. On Tue, Nov 3, 2009 at 9:39 PM, Nils Goroll wrote: > Hi Eric and all, > > Eric Schrock wrote: > >> >> On Nov 3, 2009, at 6:01 AM, Jürgen Keil wrote: >> >> I think I'm observing the same (with changeset 10936) ... >>> >>> # mkfile 2g /var/tmp/tank.img >>> # zpool create tank /var/tmp/tank.img >>> # zfs set dedup=on tank >>> # zfs create tank/foobar >>> >> >> This has to do with the fact that dedup space accounting is charged to all >> filesystems, regardless of whether blocks are deduped. To do otherwise is >> impossible, as there is no true "owner" of a block >> > > It would be great if someone could explain why it is hard (impossible? not > a > good idea?) to account all data sets for at least one reference to each > dedup'ed > block and add this space to the total free space? > > This has some interesting pathologies as the pool gets full. Namely, that >> ZFS will artificially enforce a limit on the logical size of the pool based >> on non-deduped data. This is obviously something that should be addressed. >> > > Would the idea I mentioned not address this issue as well? > > Thanks, Nils > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Anurag Agarwal CEO, Founder KQ Infotech, Pune www.kqinfotech.com 9881254401 Coordinator Akshar Bharati www.aksharbharati.org Spreading joy through reading ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup accounting
Well, then you could have more "logical space" than "physical space", and that would be extremely cool, but what happens if for some reason you wanted to turn off dedup on one of the filesystems? It might exhaust all the pool's space to do this. I think good idea would be another pool's/filesystem's property, that when turned on, would allow allocating more "logical data" than pool's capacity, but then you would accept risks that involve it. Then administrator could decide which is better for his system. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris disk confusion ?
Hi David, This RFE is filed for this feature: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6893282 Allow the zpool command to wipe labels from disks Cindy On 11/03/09 09:00, David Dyer-Bennet wrote: On Mon, November 2, 2009 20:23, Marion Hakanson wrote: You'll need to give the same "dd" treatment to the end of the disk as well; ZFS puts copies of its labels at the beginning and at the end. Does anybody else see this as rather troubling? Obviously it's dangerous to get in the habit of doing this as a routine operation, which to read advice here is how people are thinking of it. It seems to me that something in ZFS's protective procedures is missing or astray or over-active -- being protective is good, but there needs to be a way to re-use a disk that's been used before, too. And frequently people are at a loss to even understand what the possible conflict might be. Maybe a doubling of the -f option should give as full an explanation as possible of what the evidence shows as previous use, and then let you override it if you really really insist? Or some other option? Or an entirely separate utility (or script)? What I basically want, I think, is a standard way to get an explanation of exactly what ZFS thinks the conflict in my new proposed use of a disk might be -- and then a standard and as-safe-as-possible way to tell it to go ahead and use the disk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup issue
Cyril Plisko wrote: I think I'm observing the same (with changeset 10936) ... # mkfile 2g /var/tmp/tank.img # zpool create tank /var/tmp/tank.img # zfs set dedup=on tank # zfs create tank/foobar This has to do with the fact that dedup space accounting is charged to all filesystems, regardless of whether blocks are deduped. To do otherwise is impossible, as there is no true "owner" of a block, and the fact that it may or may not be deduped is often beyond the control of a single filesystem. This has some interesting pathologies as the pool gets full. Namely, that ZFS will artificially enforce a limit on the logical size of the pool based on non-deduped data. This is obviously something that should be addressed. Eric, Many people (me included) perceive deduplication as a mean to save disk space and allow more data to be squeezed into a storage. What you are saying is that effectively ZFS dedup does a wonderful job in detecting duplicate blocks and goes into all the trouble of removing an extra copies and keep accounting of everything. However, when it comes to letting me use the freed space I will be plainly denied to do so. If that so, what would be the reason to use ZFS deduplication at all ? c'mon it is obviously a bug and not a design feature. (it is I hope/think that is the case) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] marvell88sx2 driver build126
On Mon, Nov 2, 2009 at 6:34 AM, Orvar Korvar wrote: > I have the same card and might have seen the same problem. Yesterday I > upgraded to b126 and started to migrate all my data to 8 disc raidz2 > connected to such a card. And suddenly ZFS reported checksum errors. I > thought the drives were faulty. But you suggest the problem could have been > the driver? I also noticed that one of the drives had resilvered a small > amount, just like yours. > > I now use b125 and there are no checksum errors. So, is there a bug in the > new b126 driver? > Can any of you Sun folks comment on this? --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup accounting
On Tue, November 3, 2009 10:32, Bartlomiej Pelc wrote: > Well, then you could have more "logical space" than "physical space", and > that would be extremely cool, but what happens if for some reason you > wanted to turn off dedup on one of the filesystems? It might exhaust all > the pool's space to do this. I think good idea would be another > pool's/filesystem's property, that when turned on, would allow allocating > more "logical data" than pool's capacity, but then you would accept risks > that involve it. Then administrator could decide which is better for his > system. Compression has the same issues; how is that handled? (Well, except that compression is limited to the filesystem, it doesn't have cross-filesystem interactions.) They ought to behave the same with regard to reservations and quotas unless there is a very good reason for a difference. Generally speaking, I don't find "but what if you turned off dedupe?" to be a very important question. Or rather, I consider it such an important question that I'd have to consider it very carefully in light of the particular characteristics of a particular pool; no GENERAL answer is going to be generally right. Reserving physical space for blocks not currently stored seems like the wrong choice; it violates my expectations, and goes against the purpose of dedupe, which as I understand it is to save space so you can use it for other things. It's obvious to me that changing the dedupe setting (or the compression setting) would have consequences on space use, and it seems natural that I as the sysadmin am on the hook for those consequences. (I'd expect to find in the documentation explanations of what things I need to consider and how to find the detailed data to make a rational decision in any particular case.) -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs mount error
Hello A customer recently had a power outage. Prior to the outage, they did a graceful shutdown of their system. On power-up, the system is not coming up due to zfs errors as follows: cannot mount 'rpool/export': Number of symbolic links encountered during path name traversal exceeds MAXSYMLINKS mount '/export/home': failed to create mountpoint. The possible cause of this might be that a symlink is created pointing to itself since the customer stated that they created lots of symlink to get their env ready. However, since /export is not getting mounted, they can not go back and delete/fix the symlinks. Can someone suggest a way to fix this issue? Thanks Ramin Moazeni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris disk confusion ?
On 11/3/2009 3:49 PM, Marion Hakanson wrote: If the disk is going to be part of whole-disk zpool, I like to make sure there is not an old VTOC-style partition table on there. That can be done either via some "format -e" commands, or with "fdisk -E", to put an EFI label on there. unfortunately, fdisk won't help me at all: # fdisk -E /dev/rdsk/c12t1d0p0 # zpool create -f testp c12t1d0 invalid vdev specification the following errors must be manually repaired: /dev/dsk/c3t11d0s0 is part of active ZFS pool dbzpool. Please see zpool(1M). and i can't find anything in format that lets me do anything: # format -e c12t1d0 selecting c12t1d0 [disk formatted] /dev/dsk/c3t11d0s0 is part of active ZFS pool dbzpool. Please see zpool(1M). [...] format> label Cannot label disk when partitions are in use as described. I wonder if getting my hands on a pre-sol10 x86 format binary would help... -- Jeremy Kister http://jeremy.kister.net./ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup issue
On Nov 3, 2009, at 12:24 PM, Cyril Plisko wrote: I think I'm observing the same (with changeset 10936) ... # mkfile 2g /var/tmp/tank.img # zpool create tank /var/tmp/tank.img # zfs set dedup=on tank # zfs create tank/foobar This has to do with the fact that dedup space accounting is charged to all filesystems, regardless of whether blocks are deduped. To do otherwise is impossible, as there is no true "owner" of a block, and the fact that it may or may not be deduped is often beyond the control of a single filesystem. This has some interesting pathologies as the pool gets full. Namely, that ZFS will artificially enforce a limit on the logical size of the pool based on non-deduped data. This is obviously something that should be addressed. Eric, Many people (me included) perceive deduplication as a mean to save disk space and allow more data to be squeezed into a storage. What you are saying is that effectively ZFS dedup does a wonderful job in detecting duplicate blocks and goes into all the trouble of removing an extra copies and keep accounting of everything. However, when it comes to letting me use the freed space I will be plainly denied to do so. If that so, what would be the reason to use ZFS deduplication at all ? Please read my response before you respond. What do you think "this is obviously something that should be addressed" means? There is already a CR filed and the ZFS team is working on it. - Eric -- Regards, Cyril -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris disk confusion ?
>I said: >> You'll need to give the same "dd" treatment to the end of the disk as well; >> ZFS puts copies of its labels at the beginning and at the end. Oh, and zfs...@jeremykister.com said: > im not sure what you mean here - I thought p0 was the entire disk in x86 - > and s2 was the whole disk in the partition. what else should i overwrite? Sorry, yes, you did get the whole slice overwritten. Most people just add a "count=10" or something similar, to overwrite the beginning of the drive, but your invocation would overwrite the whole thing. If the disk is going to be part of whole-disk zpool, I like to make sure there is not an old VTOC-style partition table on there. That can be done either via some "format -e" commands, or with "fdisk -E", to put an EFI label on there. Anyway, I agree with the desire for "zpool" to be able to do this itself, with less possibility of human error in partitioning, etc. Glad to hear there's already an RFE filed for it. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup accounting & reservations
Hi Cyril, But: Isn't there an implicit expectation for a space guarantee associated with a dataset? In other words, if a dataset has 1GB of data, isn't it natural to expect to be able to overwrite that space with other data? One I'd say that expectation is not [always] valid. Assume you have a dataset of 1GB of data and the pool free space is 200 MB. You are cloning that dataset and trying to overwrite the data on the cloned dataset. You will hit "no more space left on device" pretty soon. Wonders of virtualization :) The point I wanted to make is that by defining a (ref)reservation for that clone, ZFS won't even create it if space does not suffice: r...@haggis:~# zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT rpool 416G 187G 229G44% ONLINE - r...@haggis:~# zfs clone -o refreservation=230g rpool/export/home/slink/t...@zfs-auto-snap:frequent-2009-11-03-22:04:46 rpool/test cannot create 'rpool/test': out of space I don't see how a similar guarantee could be given with de-dup. Nils ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] More Dedupe Questions...
Kyle McDonald wrote: Hi Darren, More below... Darren J Moffat wrote: Tristan Ball wrote: Obviously sending it deduped is more efficient in terms of bandwidth and CPU time on the recv side, but it may also be more complicated to achieve? A stream can be deduped even if the on disk format isn't and vice versa. Is the send dedup'ing more efficient if the filesystem is already depdup'd? If both are enabled do they share anything? ZFS send deduplication is still in development so I'd rather let the engineers working on it say what they are doing if they wish to. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe is in
Trevor Pretty wrote: Darren J Moffat wrote: Orvar Korvar wrote: I was under the impression that you can create a new zfs dataset and turn on the dedup functionality, and copy your data to it. Or am I wrong? you don't even have to create a new dataset just do: # zfs set dedup=on But like all ZFS functions will that not only get applied, when you (re)write (old)new data, like compression=on ? Correct but if you are creating a new dataset you are writting new data anyway. Which leads to the question would a scrub activate dedupe? Not at this time now. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Where is green-bytes dedup code?
Green-bytes is publicly selling their hardware and dedup solution today. From the feedback of others with testing from someone on our team we've found the quality of the initial putback to be buggy and not even close to production ready. (That's fine since nobody has stated it was production ready) It brings up the question though of where is the green-bytes code? They are obligated under the CDDL to release their changes *unless* they privately bought a license from Sun. It seems the conflicts from the lawsuit may or may not be resolved, but still.. Where's the code? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup accounting & reservations
Well, then you could have more "logical space" than "physical space" Reconsidering my own question again, it seems to me that the question of space management is probably more fundamental than I had initially thought, and I assume members of the core team will have thought through much of it. I will try to share my thoughts and I would very much appreciate any corrections or additional explanations. For dedup, my understanding at this point is that, first of all, every reference to dedup'ed data must be accounted to the respective dataset. Obviously, a decision has been made to account that space as "used", rather than "referenced". I am trying to understand, why. At first sight, referring to the definition of "used" space as being unique to the respective dataset, it would seem natural to account all de-duped space as "referenced". But this could lead to much space never being accounted as "used" anywhere (but for the pool). This would differs from the observed behavior from non-deduped datasets, where, to my understanding, all "referred" space is "used" by some other dataset. Despite being a little counter-intuitive, first I found this simple solution quite attractive, because it wouldn't alter the semantics of used vs. referenced space (under the assumption that my understanding is correct). My understanding from Eric's explanation is that it has been decided to go an alternative route and account all de-duped space as "used" to all datasets referencing it because, in contrast to snapshots/clones, it is impossible (?) to differentiate between used and referred space for de-dup. Also, at first sight, this seems to be a way to keep the current semantics for (ref)reservations. But while without de-dup, all the usedsnap and usedds values should roughly sum up to the pool used space, they can't with this concept - which is why I thought a solution could be to compensate for multiply accounted "used" space by artificially increasing the pool size. Instead, from the examples given here, what seems to have been implemented with de-dup is to simply maintain space statistics for the pool on the basis of actually used space. While one find it counter-intuitive that the used sizes of all datasets/snapshots will exceed the pool used size with de-dedup, if my understanding is correct, this design seems to be consistent. I am very interested in the reasons why this particular approach has been chosen and why others have been dropped. Now to the more general question: If all datasets of a pool contained the same data and got de-duped, the sums of their "used" space still seems to be limited by the "locical" pool size, as we've seen in examples given by Jürgen and others and, to get a benefit of de-dup, this implementation obviously needs to be changed. But: Isn't there an implicit expectation for a space guarantee associated with a dataset? In other words, if a dataset has 1GB of data, isn't it natural to expect to be able to overwrite that space with other data? One might want to define space guarantees (like with (ref)reservation), but I don't see how those should work with the currently implemented concept. Do we need something like a de-dup-reservation, which is substracted from the pool free space? Thank you for reading, Nils ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup accounting & reservations
> No point in trying to preserve a naive mental model that simply can't stand up to reality. I kind of dislike the idea to talk about naiveness here. Being able to give guarantees (in this case: reserve space) can be vital for running critical business applications. Think about the analogy in memory management (proper swap space reservation vs. the oom-killer). But I realize that talking about an "implicit expectation" to give some motivation for reservations probably lead to some misunderstanding. Sorry, Nils ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup accounting & reservations
> But: Isn't there an implicit expectation for a space guarantee associated > with a > dataset? In other words, if a dataset has 1GB of data, isn't it natural to > expect to be able to overwrite that space with other > data? Is there such a space guarantee for compressed or cloned zfs? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs mount error
On Mon, Nov 2, 2009 at 1:34 PM, Ramin Moazeni wrote: > Hello > > A customer recently had a power outage. Prior to the outage, they did a > graceful shutdown of their system. > On power-up, the system is not coming up due to zfs errors as follows: > cannot mount 'rpool/export': Number of symbolic links encountered during > path name traversal exceeds MAXSYMLINKS > mount '/export/home': failed to create mountpoint. > > The possible cause of this might be that a symlink is created pointing to > itself since the customer stated > that they created lots of symlink to get their env ready. However, since > /export is not getting mounted, they > can not go back and delete/fix the symlinks. > > Can someone suggest a way to fix this issue? > > Thanks > Ramin Moazeni > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > I see these very frequently on my systems, regardless of a clean shutdown or not, 1/3 of the time filesystems cannot mount. What I do, is boot into single user mode, make sure the filesystem in question is NOT mounted, and just delete the directory that its trying to mount into. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS dedup vs compression vs ZFS user/group quotas
We recently found that the ZFS user/group quota accounting for disk-usage worked "opposite" to what we were expecting. Ie, any space saved from compression was a benefit to the customer, not to us. (We expected the Google style: Give a customer 2GB quota, and if compression saves space, that is profit to us) Is the space saved with dedup charged in the same manner? I would expect so, I figured some of you would just know. I will check when b128 is out. I don't suppose I can change the model? :) Lund -- Jorgen Lundman | Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs mount error
Ramin I don't know but.. Is the error not from mount and it's /export/home that can't be created? "mount '/export/home': failed to create mountpoint." Have you tried mounting 'rpool/export' somewhere else, ike .mnt? Ramin Moazeni wrote: Hello A customer recently had a power outage. Prior to the outage, they did a graceful shutdown of their system. On power-up, the system is not coming up due to zfs errors as follows: cannot mount 'rpool/export': Number of symbolic links encountered during path name traversal exceeds MAXSYMLINKS mount '/export/home': failed to create mountpoint. The possible cause of this might be that a symlink is created pointing to itself since the customer stated that they created lots of symlink to get their env ready. However, since /export is not getting mounted, they can not go back and delete/fix the symlinks. Can someone suggest a way to fix this issue? Thanks Ramin Moazeni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss www.eagle.co.nz This email is confidential and may be legally privileged. If received in error please destroy and immediately notify us. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup accounting & reservations
On Tue, Nov 3, 2009 at 10:54 PM, Nils Goroll wrote: > Now to the more general question: If all datasets of a pool contained the > same data and got de-duped, the sums of their "used" space still seems to be > limited by the "locical" pool size, as we've seen in examples given by > Jürgen and others and, to get a benefit of de-dup, this implementation > obviously needs to be changed. Agreed. > > But: Isn't there an implicit expectation for a space guarantee associated > with a dataset? In other words, if a dataset has 1GB of data, isn't it > natural to expect to be able to overwrite that space with other data? One I'd say that expectation is not [always] valid. Assume you have a dataset of 1GB of data and the pool free space is 200 MB. You are cloning that dataset and trying to overwrite the data on the cloned dataset. You will hit "no more space left on device" pretty soon. Wonders of virtualization :) -- Regards, Cyril ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on multiple machines
Miha If you do want multi-reader, multi-writer block access (and not use iSCSI) then QFS is what you want. http://www.sun.com/storage/management_software/data_management/qfs/features.xml You can use ZFS pools are lumps of disk under SAM-QFS:- https://blogs.communication.utexas.edu/groups/techteam/weblog/5e700/ I successfully mocked this up on VirtualBox on my laptop for a customer. Trevor Darren J Moffat wrote: Miha Voncina wrote: Hi, is it possible to link multiple machines into one storage pool using zfs? Depends what you mean by this. Multiple machines can not import the same ZFS pool at the same time, doing so *will* cause corruption and ZFS tries hard to protect against multiple imports. However ZFS can use iSCSI LUNs from multiple target machines for its disks that make up a given pool. ZFS volumes (ZVOLS) can also be used as iSCSI targets and thus shared out to multiple machines. ZFS file systems can be shared over NFS and CIFS and thus shared by multiple machines. ZFS pools can be used in a Sun Cluster configuration but will only imported into a single node of a Sun Cluster configuration at a time. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss www.eagle.co.nz This email is confidential and may be legally privileged. If received in error please destroy and immediately notify us. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup accounting & reservations
On Tue, November 3, 2009 16:36, Nils Goroll wrote: > > No point in trying to preserve a naive mental model that >> simply can't stand up to reality. > > I kind of dislike the idea to talk about naiveness here. Maybe it was a poor choice of words; I mean something more along the lines of "simplistic". The point is, "space" is no longer as simple a concept as it was 40 years ago. Even without deduplication, there is the possibility of clones and compression causing things not to behave the same way a simple filesystem on a hard drive did long ago. > Being able to give guarantees (in this case: reserve space) can be vital > for > running critical business applications. Think about the analogy in memory > management (proper swap space reservation vs. the oom-killer). In my experience, systems that run on the edge of their resources and depend on guarantees to make them work have endless problems, whereas if they are not running on the edge of their resources, they work fine regardless of guarantees. For a very few kinds of embedded systems I can see the need to work to the edges (aircraft flight systems, for example), but that's not something you do in a general-purpose computer with a general-purpose OS. > But I realize that talking about an "implicit expectation" to give some > motivation for reservations probably lead to some misunderstanding. > > Sorry, Nils There's plenty of real stuff worth discussing around this issue, and I apologize for choosing a belittling term to express disagreement. I hope it doesn't derail the discussion. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup question
On 11/ 2/09 07:42 PM, Craig S. Bell wrote: I just stumbled across a clever visual representation of deduplication: http://loveallthis.tumblr.com/post/166124704 It's a flowchart of the lyrics to "Hey Jude". =-) Nothing is compressed, so you can still read all of the words. Instead, all of the duplicates have been folded together. -cheers, CSB This should reference the prior (April 1, 1984) research by Donald Knuth at http://www.cs.utexas.edu/users/arvindn/misc/knuth_song_complexity.pdf :-) Jeff -- Jeff Savit Principal Field Technologist Sun Microsystems, Inc.Phone: 732-537-3451 (x63451) 2398 E Camelback Rd Email: jeff.sa...@sun.com Phoenix, AZ 85016http://blogs.sun.com/jsavit/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Where is green-bytes dedup code?
On Tuesday, November 3, 2009, "C. Bergström" wrote: > Green-bytes is publicly selling their hardware and dedup solution today. > From the feedback of others with testing from someone on our team we've found > the quality of the initial putback to be buggy and not even close to > production ready. (That's fine since nobody has stated it was production > ready) > > It brings up the question though of where is the green-bytes code? They are > obligated under the CDDL to release their changes *unless* they privately > bought a license from Sun. It seems the conflicts from the lawsuit may or > may not be resolved, but still.. > > Where's the code? I highly doubt you're going to get any commentary from sun engineers on pending litigation. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] MPxIO and removing physical devices
I am a bit of a Solaris newbie. I have a brand spankin' new Solaris 10u8 machine (x4250) that is running an attached J4400 and some internal drives. We're using multipathed SAS I/O (enabled via stmsboot), so the device mount points have been moved off from their "normal" c0t5d0 to long strings -- in the case of c0t5d0, it's now /dev/rdsk/c6t5000CCA00A274EDCd0. (I can see the cross-referenced devices with stmsboot -L.) Normally, when replacing a disk on a Solaris system, I would run cfgadm -c unconfigure c0::dsk/c0t5d0. However, cfgadm -l does not list c6, nor does it list any disks. In fact, running cfgadm against the places where I think things are supposed to live gets me the following: bash# cfgadm -l /dev/rdsk/c0t5d0 Ap_Id Type Receptacle Occupant Condition /dev/rdsk/c0t5d0: No matching library found bash# cfgadm -l /dev/rdsk/c6t5000CCA00A274EDCd0 cfgadm: Attachment point not found bash# cfgadm -l /dev/dsk/c6t5000CCA00A274EDCd0 Ap_Id Type Receptacle Occupant Condition /dev/dsk/c6t5000CCA00A274EDCd0: No matching library found bash# cfgadm -l c6t5000CCA00A274EDCd0 Ap_Id Type Receptacle Occupant Condition c6t5000CCA00A274EDCd0: No matching library found I ran devfsadm -C -v and it removed all of the old attachment points for the /dev/dsk/c0t5d0 devices and created some for the c6 devices. Running cfgadm -al shows a c0, c4, and c5 -- these correspond to the actual controllers, but no devices are attached to the controllers. I found an old email on this list about MPxIO that said the solution was basically to yank the physical device after making sure that no I/O was happening to it. While this worked and allowed us to return the device to service as a spare in the zpool it inhabits, more concerning was what happened when we ran mpathadm list lu after yanking the device and returning it to service: -- bash# mpathadm list lu /dev/rdsk/c6t5000CCA00A2A9398d0s2 Total Path Count: 1 Operational Path Count: 1 /dev/rdsk/c6t5000CCA00A29EE2Cd0s2 Total Path Count: 1 Operational Path Count: 1 /dev/rdsk/c6t5000CCA00A2BDBFCd0s2 Total Path Count: 1 Operational Path Count: 1 /dev/rdsk/c6t5000CCA00A2A8E68d0s2 Total Path Count: 1 Operational Path Count: 1 /dev/rdsk/c6t5000CCA00A0537ECd0s2 Total Path Count: 1 Operational Path Count: 1 mpathadm: Error: Unable to get configuration information. mpathadm: Unable to complete operation (Side note: Some of the disks are single path via an internal controller, and some of them are multi path in the J4400 via two external controllers.) A reboot fixed the 'issue' with mpathadm and it now outputs complete data. So -- how do I administer and remove physical devices that are in multipath-managed controllers on Solaris 10u8 without breaking multipath and causing configuration changes that interfere with the services and devices attached via mpathadm and the other voodoo and black magic inside? I can't seem to find this documented anywhere, even if the instructions to enable multipathing with stmsboot -e were quite complete and worked well! Thanks, Karl Katzke -- Karl Katzke Systems Analyst II TAMU - RGS ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup accounting
> Well, then you could have more "logical space" than > "physical space", and that would be extremely cool, I think we already have that, with zfs clones. I often clone a zfs onnv workspace, and everything is "deduped" between zfs parent snapshot and clone filesystem. The clone (initially) needs no extra zpool space. And with zfs clone I can actually use all the remaining free space from the zpool. With zfs deduped blocks, I can't ... > but what happens if for some reason you wanted to > turn off dedup on one of the filesystems? It might > exhaust all the pool's space to do this. As far as I understand it, nothing happens for existing deduped blocks when you turn off dedup for a zfs filesystem. The new dedup=off setting is affecting new written blocks only. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup issue
I'm fairly new to all this and I think that is the intended behavior. Also from my limited understanding I believe dedup behavior it would significantly cut down on access times. For the most part though this is such new code that I would wait abit to see where they take it. On Tue, Nov 3, 2009 at 3:24 PM, Cyril Plisko wrote: > >>> I think I'm observing the same (with changeset 10936) ... > >> > >> # mkfile 2g /var/tmp/tank.img > >> # zpool create tank /var/tmp/tank.img > >> # zfs set dedup=on tank > >> # zfs create tank/foobar > > > > This has to do with the fact that dedup space accounting is charged to > all > > filesystems, regardless of whether blocks are deduped. To do otherwise > is > > impossible, as there is no true "owner" of a block, and the fact that it > may > > or may not be deduped is often beyond the control of a single filesystem. > > > > This has some interesting pathologies as the pool gets full. Namely, > that > > ZFS will artificially enforce a limit on the logical size of the pool > based > > on non-deduped data. This is obviously something that should be > addressed. > > > > Eric, > > Many people (me included) perceive deduplication as a mean to save > disk space and allow more data to be squeezed into a storage. What you > are saying is that effectively ZFS dedup does a wonderful job in > detecting duplicate blocks and goes into all the trouble of removing > an extra copies and keep accounting of everything. However, when it > comes to letting me use the freed space I will be plainly denied to do > so. If that so, what would be the reason to use ZFS deduplication at > all ? > > > -- > Regards, > Cyril > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup issue
>>> I think I'm observing the same (with changeset 10936) ... >> >> # mkfile 2g /var/tmp/tank.img >> # zpool create tank /var/tmp/tank.img >> # zfs set dedup=on tank >> # zfs create tank/foobar > > This has to do with the fact that dedup space accounting is charged to all > filesystems, regardless of whether blocks are deduped. To do otherwise is > impossible, as there is no true "owner" of a block, and the fact that it may > or may not be deduped is often beyond the control of a single filesystem. > > This has some interesting pathologies as the pool gets full. Namely, that > ZFS will artificially enforce a limit on the logical size of the pool based > on non-deduped data. This is obviously something that should be addressed. > Eric, Many people (me included) perceive deduplication as a mean to save disk space and allow more data to be squeezed into a storage. What you are saying is that effectively ZFS dedup does a wonderful job in detecting duplicate blocks and goes into all the trouble of removing an extra copies and keep accounting of everything. However, when it comes to letting me use the freed space I will be plainly denied to do so. If that so, what would be the reason to use ZFS deduplication at all ? -- Regards, Cyril ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup accounting & reservations
On Tue, November 3, 2009 15:06, Cyril Plisko wrote: > On Tue, Nov 3, 2009 at 10:54 PM, Nils Goroll wrote: >> But: Isn't there an implicit expectation for a space guarantee >> associated >> with a dataset? In other words, if a dataset has 1GB of data, isn't it >> natural to expect to be able to overwrite that space with other data? >> One > > I'd say that expectation is not [always] valid. Assume you have a > dataset of 1GB of data and the pool free space is 200 MB. You are > cloning that dataset and trying to overwrite the data on the cloned > dataset. You will hit "no more space left on device" pretty soon. > Wonders of virtualization :) Yes, and the same is true potentially with compression as well; if the old data blocks are actually deleted and freed up (meaning no snapshots or other things keeping them around), the new data still may not fit in those blocks due to differing compression based on what the data actually is. So that's a bit of assumption we're just going to have to get over making in general. No point in trying to preserve a naive mental model that simply can't stand up to reality. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] virsh troubling zfs!?
Hi and hello, I have a problem confusing me. I hope someone can help me with it. I followed a "best practise" - I think - using dedicated zfs filesystems for my virtual machines. Commands (for completion): [i]zfs create rpool/vms[/i] [i]zfs create rpool/vms/vm1[/i] [i] zfs create -V 10G rpool/vms/vm1/vm1-dsk[/i] This command creates the file system [i]/rpool/vms/vm1/vm1-dsk[/i] and the according [i]/dev/zvol/dsk/rpool/vms/vm1/vm1-dsk[/i]. If I delete a VM i set up using this filesystem via[i] virsh undefine vm1[/i] the [i]/rpool/vms/vm1/vm1-dsk[/i] gets also deleted, but the [i]/dev/zvol/dsk/rpool/vms/vm1/vm1-dsk[/i] is left. Without [i]/rpool/vms/vm1/vm1-dsk[/i] I am not able to do [i]zfs destroy rpool/vms/vm1/vm1-dsk[/i] so the [i]/dev/zvol/dsk/rpool/vms/vm1/vm1-dsk[/i] could not be destroyed "and will be left forever"!? How can I get rid of this problem? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup accounting & reservations
Hi David, simply can't stand up to reality. I kind of dislike the idea to talk about naiveness here. Maybe it was a poor choice of words; I mean something more along the lines of "simplistic". The point is, "space" is no longer as simple a concept as it was 40 years ago. Even without deduplication, there is the possibility of clones and compression causing things not to behave the same way a simple filesystem on a hard drive did long ago. Thanks for emphasizing this again - I do absolutely agree that with today's technologies proper monitoring and proactive management is much more important than ever before. But, again, risks can be reduced. Being able to give guarantees (in this case: reserve space) can be vital for running critical business applications. Think about the analogy in memory management (proper swap space reservation vs. the oom-killer). In my experience, systems that run on the edge of their resources and depend on guarantees to make them work have endless problems, whereas if they are not running on the edge of their resources, they work fine regardless of guarantees. Agree. But what if things go wrong and a process eats up all your storage in error? If it's got its own dataset and you've used a reservation for your critical application on another dataset, you have a higher chance of surviving. There's plenty of real stuff worth discussing around this issue, and I apologize for choosing a belittling term to express disagreement. I hope it doesn't derail the discussion. It certainly won't on my side. Thank you for the clarification. Thanks, Nils ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe is in
Darren J Moffat wrote: Orvar Korvar wrote: I was under the impression that you can create a new zfs dataset and turn on the dedup functionality, and copy your data to it. Or am I wrong? you don't even have to create a new dataset just do: # zfs set dedup=on But like all ZFS functions will that not only get applied, when you (re)write (old)new data, like compression=on ? Which leads to the question would a scrub activate dedupe? www.eagle.co.nz This email is confidential and may be legally privileged. If received in error please destroy and immediately notify us. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS non-zero checksum and permanent error with deleted file
Hello, I am actually using ZFS under FreeBSD, but maybe someone over here can help me anyway. I'd like some advice if I still can rely on one of my ZFS pools: [u...@host ~]$ sudo zpool clear zpool01 ... [u...@host ~]$ sudo zpool scrub zpool01 ... [u...@host ~]$ sudo zpool status -v zpool01 pool: zpool01 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM zpool01 ONLINE 0 0 4 raidz1ONLINE 0 0 4 ad12ONLINE 0 0 0 ad14ONLINE 0 0 0 ad16ONLINE 0 0 0 ad18ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: zpool01:<0x3736a> How can there be an error in a file that does not seem to exist ? How can I clear / recover from the error ? I have read the corresponding documentation and did the obligatory research, but so far, the only option I can see is a full destroy/create cycle - which seems an overkill, considering the pool size and the fact that there seems to be only one (deleted ?) file involved. [u...@host ~]$ df -h /mnt/zpool01/ FilesystemSizeUsed Avail Capacity Mounted on zpool01 1.3T1.2T133G90%/mnt/zpool01 [u...@host ~]$ uname -a FreeBSD host.domain 7.2-RELEASE FreeBSD 7.2-RELEASE #0: Fri May 1 07:18:07 UTC 2009 r...@driscoll.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64 Cheers, ssc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup issue
Eric Schrock wrote: On Nov 3, 2009, at 12:24 PM, Cyril Plisko wrote: I think I'm observing the same (with changeset 10936) ... # mkfile 2g /var/tmp/tank.img # zpool create tank /var/tmp/tank.img # zfs set dedup=on tank # zfs create tank/foobar This has to do with the fact that dedup space accounting is charged to all filesystems, regardless of whether blocks are deduped. To do otherwise is impossible, as there is no true "owner" of a block, and the fact that it may or may not be deduped is often beyond the control of a single filesystem. This has some interesting pathologies as the pool gets full. Namely, that ZFS will artificially enforce a limit on the logical size of the pool based on non-deduped data. This is obviously something that should be addressed. Eric, Many people (me included) perceive deduplication as a mean to save disk space and allow more data to be squeezed into a storage. What you are saying is that effectively ZFS dedup does a wonderful job in detecting duplicate blocks and goes into all the trouble of removing an extra copies and keep accounting of everything. However, when it comes to letting me use the freed space I will be plainly denied to do so. If that so, what would be the reason to use ZFS deduplication at all ? Please read my response before you respond. What do you think "this is obviously something that should be addressed" means? There is already a CR filed and the ZFS team is working on it. We have a fix for this and it should be available in a couple of days. - George - Eric -- Regards, Cyril -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss