Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Hi, Here is the result on a Dell Precision T5500 with 24 GB of RAM and two HD in a mirror (SATA, 7200 rpm, NCQ). [glehm...@marvin2 tmp]$ uname -a SunOS marvin2 5.11 snv_117 i86pc i386 i86pc Solaris [glehm...@marvin2 tmp]$ pfexec ./zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /rpool/ zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real8m19,74s user0m6,47s sys 0m25,32s Doing second 'cpio -o > /dev/null' 48000247 blocks real10m42,68s user0m8,35s sys 0m30,93s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. HTH, Gaëtan Le 13 juil. 09 à 01:15, Scott Lawson a écrit : Bob, Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool called test1 which is contained on a raid 1 volume on a 6140 with 7.50.13.10 firmware on the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks. This machine is brand new with a clean install of S10 05/09. It is destined to become a Oracle 10 server with ZFS filesystems for zones and DB volumes. [r...@xxx /]#> uname -a SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise [r...@xxx /]#> cat /etc/release Solaris 10 5/09 s10s_u7wos_08 SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 30 March 2009 [r...@xxx /]#> prtdiag -v | more System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise M3000 Server System clock frequency: 1064 MHz Memory size: 16384 Megabytes Here is the run output for you. [r...@xxx tmp]#> ./zfs-cache-test.ksh test1 zfs create test1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /test1/ zfscachetest ... Done! zfs unmount test1/zfscachetest zfs mount test1/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real4m48.94s user0m21.58s sys 0m44.91s Doing second 'cpio -o > /dev/null' 48000247 blocks real6m39.87s user0m21.62s sys 0m46.20s Feel free to clean up with 'zfs destroy test1/zfscachetest'. Looks like a 25% performance loss for me. I was seeing around 80MB/s sustained on the first run and around 60M/'s sustained on the 2nd. /Scott. Bob Friesenhahn wrote: There has been no forward progress on the ZFS read performance issue for a week now. A 4X reduction in file read performance due to having read the file before is terrible, and of course the situation is considerably worse if the file was previously mmapped as well. Many of us have sent a lot of money to Sun and were not aware that ZFS is sucking the life out of our expensive Sun hardware. It is trivially easy to reproduce this problem on multiple machines. For example, I reproduced it on my Blade 2500 (SPARC) which uses a simple mirrored rpool. On that system there is a 1.8X read slowdown from the file being accessed previously. In order to raise visibility of this issue, I invite others to see if they can reproduce it in their ZFS pools. The script at http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Implements a simple test. It requires a fair amount of disk space to run, but the main requirement is that the disk space consumed be more than available memory so that file data gets purged from the ARC. The script needs to run as root since it creates a filesystem and uses mount/umount. The script does not destroy any data. There are several adjustments which may be made at the front of the script. The pool 'rpool' is used by default, but the name of the pool to test may be supplied via an argument similar to: # ./zfs-cache-test.ksh Sun_2540 zfs create Sun_2540/zfscachetest Creating data file set (3000 files of 8192000 bytes) under / Sun_2540/zfscachetest ... Done! zfs unmount Sun_2540/zfscachetest zfs mount Sun_2540/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real2m54.17s user0m7.65s sys 0m36.59s Doing second 'cpio -o > /dev/null' 48000247 blocks real11m54.65s user0m7.70s sys 0m35.06s Feel free to clean up with 'zfs destroy Sun_2540/zfscachetest'. And here is a similar run on my Blade 2500 using the default rpool: # ./zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /rpool/ zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real13m3.91s user2m43.04s sys 9m28.73s Doing second 'cpio -o > /dev/null' 48000247 blocks real23m50.27s user2m41.81s sys 9m46.76s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. I am interested to hear about systems which do not suffer from this bu
Re: [zfs-discuss] deduplication
Richard, > Also, we now know the market value for dedupe intellectual property: $2.1 > Billion. > Even though there may be open source, that does not mean there are not IP > barriers. $2.1 Billion attracts a lot of lawyers :-( Indeed, good point. -- Regards, Cyril ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob, On Sun, Jul 12, 2009 at 23:38, Bob Friesenhahn wrote: > There has been no forward progress on the ZFS read performance issue for a > week now. A 4X reduction in file read performance due to having read the > file before is terrible, and of course the situation is considerably worse > if the file was previously mmapped as well. Many of us have sent a lot of > money to Sun and were not aware that ZFS is sucking the life out of our > expensive Sun hardware. > > It is trivially easy to reproduce this problem on multiple machines. For > example, I reproduced it on my Blade 2500 (SPARC) which uses a simple > mirrored rpool. On that system there is a 1.8X read slowdown from the file > being accessed previously. > > In order to raise visibility of this issue, I invite others to see if they > can reproduce it in their ZFS pools. The script at > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh > > Implements a simple test. --($ ~)-- time sudo ksh zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 Blöcke real4m7.70s user0m24.10s sys 1m5.99s Doing second 'cpio -o > /dev/null' 48000247 Blöcke real1m44.88s user0m22.26s sys 0m51.56s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. real10m47.747s user0m54.189s sys 3m22.039s This is a M4000 mit 32 GB RAM and two HDs in a mirror. Alexander -- [[ http://zensursula.net ]] [ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ] [ Mehr => http://zyb.com/alexws77 ] [ Chat => Jabber: alexw...@jabber80.com | Google Talk: a.sk...@gmail.com ] [ Mehr => AIM: alexws77 ] [ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo 'CLICK!' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Hey Bob, Here are my results on a Dual 2.2Ghz Opteron, 8GB of RAM and 16 SATA disks connected via a Supermicro AOC-SAT2-MV8 (albeit with one dead drive). Looks like a 5x slowdown to me: Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 blocks real4m46.45s user0m10.29s sys 0m58.27s Doing second 'cpio -o > /dev/null' 48000247 blocks real15m50.62s user0m10.54s sys 1m11.86s Ross -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Hi, Solaris 10U7, patched to the latest released patches two weeks ago. Four ST31000340NS attached to two SI3132 SATA controller, RAIDZ1. Selfmade system with 2GB RAM and an x86 (chipid 0x0 AuthenticAMD family 15 model 35 step 2 clock 2210 MHz) AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ processor. On the first run throughput was ~110MB/s, on the second run only 80MB/s. Doing initial (unmount/mount) 'cpio -o > /dev/null' 48000247 Blöcke real3m37.17s user0m11.15s sys 0m47.74s Doing second 'cpio -o > /dev/null' 48000247 Blöcke real4m55.69s user0m10.69s sys 0m47.57s Daniel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"
Yup, just hit exactly the same myself. I have a feeling this faulted disk is affecting performance, so tried to remove or offline it: $ zpool iostat -v 30 capacity operationsbandwidth pool used avail read write read write -- - - - - - - rc-pool 1.27T 1015G682 71 84.0M 1.88M mirror 199G 265G 0 5 0 21.1K c4t1d0 - - 0 2 0 21.1K c4t2d0 - - 0 0 0 0 c5t1d0 - - 0 2 0 21.1K mirror 277G 187G170 7 21.1M 322K c4t3d0 - - 58 4 7.31M 322K c5t2d0 - - 54 4 6.83M 322K c5t0d0 - - 56 4 6.99M 322K mirror 276G 188G171 6 21.1M 336K c5t3d0 - - 56 4 7.03M 336K c4t5d0 - - 56 3 7.03M 336K c4t4d0 - - 56 3 7.04M 336K mirror 276G 188G169 6 20.9M 353K c5t4d0 - - 57 3 7.17M 353K c5t5d0 - - 54 4 6.79M 353K c4t6d0 - - 55 3 6.99M 353K mirror 277G 187G171 10 20.9M 271K c4t7d0 - - 56 4 7.11M 271K c5t6d0 - - 55 5 6.93M 271K c5t7d0 - - 55 5 6.88M 271K c6d1p0 32K 504M 0 34 0 620K -- - - - - - - 20MB in 30 seconds for 3 disks that's 220kb/s. Not healthy at all. $ zpool status pool: rc-pool state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. see: http://www.sun.com/msg/ZFS-8000-K4 scrub: scrub completed after 2h55m with 0 errors on Tue Jun 23 11:11:42 2009 config: NAMESTATE READ WRITE CKSUM rc-pool DEGRADED 0 0 0 mirrorDEGRADED 0 0 0 c4t1d0 ONLINE 0 0 0 c4t2d0 FAULTED 1.71M 23.3M 0 too many errors c5t1d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c4t7d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 logsDEGRADED 0 0 0 c6d1p0ONLINE 0 0 0 errors: No known data errors # zpool offline rc-pool c4t2d0 cannot offline c4t2d0: no valid replicas # zpool remove rc-pool c4t2d0 cannot remove c4t2d0: only inactive hot spares or cache devices can be removed -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
x4540 running svn117 # ./zfs-cache-test.ksh zpool1 zfs create zpool1/zfscachetest creating data file set 93000 files of 8192000 bytes0 under /zpool1/zfscachetest ... done1 zfs unmount zpool1/zfscachetest zfs mount zpool1/zfscachetest doing initial (unmount/mount) 'cpio -o . /dev/null' 48000247 blocks real4m7.13s user0m9.27s sys 0m49.09s doing second 'cpio -o . /dev/null' 48000247 blocks real4m52.52s user0m9.13s sys 0m47.51s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] OpenSolaris 2008.11 - resilver still restarting
Just look at this. I thought all the restarting resilver bugs were fixed, but it looks like something odd is still happening at the start: Status immediately after starting resilver: # zpool status pool: rc-pool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 0h0m, 0.00% done, 57h3m to go config: NAME STATE READ WRITE CKSUM rc-pool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c4t1d0ONLINE 0 0 0 5.56M resilvered replacing DEGRADED 0 0 0 c4t2d0s0/o FAULTED 1.71M 23.3M 0 too many errors c4t2d0 ONLINE 0 0 0 5.43M resilvered c5t1d0ONLINE 0 0 0 5.55M resilvered And a few minutes later: # zpool status pool: rc-pool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 0h0m, 0.00% done, 245h21m to go config: NAME STATE READ WRITE CKSUM rc-pool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c4t1d0ONLINE 0 0 0 1.10M resilvered replacing DEGRADED 0 0 0 c4t2d0s0/o FAULTED 1.71M 23.3M 0 too many errors c4t2d0 ONLINE 0 0 0 824K resilvered c5t1d0ONLINE 0 0 0 1.10M resilvered It's gone from 5MB resilvered to 1MB, and increased the estimated time to 245 hours. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] first use send/receive... somewhat confused.
Richard Elling writes: > You can only send/receive snapshots. However, on the receiving end, > there will also be a dataset of the name you choose. Since you didn't > share what commands you used, it is pretty impossible for us to > speculate what you might have tried. I thought I made it clear I had not used any commands but gave two detailed examples of different ways to attempt the move. I see now the main thing that confused me is that sending a z1/proje...@something to a new z2/proje...@something would also result in z2/projects being created. That part was not at all clear to me from the man page. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] first use send/receive... somewhat confused.
Ross Walker writes: > Once the data is copied you can delete the snapshots that will then > exist on both pools. That's the part right there that wasn't apparent. That zfs send z1/someth...@snap |zfs receive z2/someth...@snap Would also create z2/something > If you have mount options set use the -u option on the recv to have it > defer attempting mounting the conflicting datasets. That little item right there... has probably saved me some serious headaches since in my scheme... the resulting newpool/fs would have the same mount point. ... thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Here's a more useful output, with having set the number of files to 6000, so that it has a dataset which is larger than the amount of RAM. --($ ~)-- time sudo ksh zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (6000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 96000493 Blöcke real8m44.82s user0m46.85s sys2m15.01s Doing second 'cpio -o > /dev/null' 96000493 Blöcke real29m15.81s user0m45.31s sys3m2.36s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. real48m40.890s user1m47.192s sys8m2.165s Still on S10 U7 Sparc M4000. So I'm now inline with the other results - the 2nd run is WAY slower. 4x as slow. Alexander -- [[ http://zensursula.net ]] [ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ] [ Mehr => http://zyb.com/alexws77 ] [ Chat => Jabber: alexw...@jabber80.com | Google Talk: a.sk...@gmail.com ] [ Mehr => AIM: alexws77 ] [ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo 'CLICK!' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] first use send/receive... somewhat confused.
> Richard Elling writes: > >> You can only send/receive snapshots. However, on the receiving end, >> there will also be a dataset of the name you choose. Since you didn't >> share what commands you used, it is pretty impossible for us to >> speculate what you might have tried. > > I thought I made it clear I had not used any commands but gave two > detailed examples of different ways to attempt the move. > > I see now the main thing that confused me is that sending a > z1/proje...@something > to a new z2/proje...@something would also result in z2/projects being > created. > > That part was not at all clear to me from the man page. This will probably get me bombed with napalm but I often just use star from Jörg Schilling because its dead easy : star -copy -p -acl -sparse -dump -C old_dir . new_dir and you're done.[1] So long as you have both the new and the old zfs/ufs/whatever[2] filesystems mounted. It doesn't matter if they are static or not. If anything changes on the filesystem then star will tell you about it. -- Dennis [1] -p means preserve meta-properties of the files/dirs etc. -acl means what it says. Grabs ACL data also. -sparse means what it says. Handles files with holes in them. -dump means be super careful about everything ( read the manpage ) [2] star doesn't care if its zfs or ufs or a CDROM or a floppy. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs the entire server (please help; data included)
Thanks to the help of a zfs/kernel developer at Sun who volunteered to help me, it turns out this was a bug in solaris that needs to be fixed. Bug report here for the curious: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6859446 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris 2008.11 - resilver still restarting
Ross, I feel you here, but I don't have much of a solution. The best I can suggest (and has been my solution) is to take out the problematic disk, copy it to a fresh disk (preferably using something like dd_rescue) and then re-install. It seems the resilvering loop is generally a result of a faulty device, but even if it is taken offline, you still have issues. I have had so many zpool resilvering loops, it's not funny. I'm running 2009.06 with all updates applied. I've had a very, very bad batch of disks. I actually have a resilvering loop running right now, and I need to go copy off the offending device. Again. I wish I had a better solution, because the zpool functions fine, no data errors, but resilvering loops forever. I love ZFS as an on-disk format. I increasingly hate the implementation of ZFS software. -Galen On Jul 13, 2009, at 5:34 AM, Ross wrote: Just look at this. I thought all the restarting resilver bugs were fixed, but it looks like something odd is still happening at the start: Status immediately after starting resilver: # zpool status pool: rc-pool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 0h0m, 0.00% done, 57h3m to go config: NAME STATE READ WRITE CKSUM rc-pool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c4t1d0ONLINE 0 0 0 5.56M resilvered replacing DEGRADED 0 0 0 c4t2d0s0/o FAULTED 1.71M 23.3M 0 too many errors c4t2d0 ONLINE 0 0 0 5.43M resilvered c5t1d0ONLINE 0 0 0 5.55M resilvered And a few minutes later: # zpool status pool: rc-pool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 0h0m, 0.00% done, 245h21m to go config: NAME STATE READ WRITE CKSUM rc-pool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c4t1d0ONLINE 0 0 0 1.10M resilvered replacing DEGRADED 0 0 0 c4t2d0s0/o FAULTED 1.71M 23.3M 0 too many errors c4t2d0 ONLINE 0 0 0 824K resilvered c5t1d0ONLINE 0 0 0 1.10M resilvered It's gone from 5MB resilvered to 1MB, and increased the estimated time to 245 hours. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris 2008.11 - resilver still restarting
Maybe it's the disks firmware that is bad or maybe they're jumpered for 1.5Gbps on a 3.0 only bus? Or maybe it's a problem with the disk cable/bay/enclosure/slot? It sounds like there is more then ZFS in the mix here. I wonder if the drive's status keeps flapping online/offline and either ZFS or FMA are too lax in marking a drive offline after recurring timeouts. Take a look at your disk enclosure and iostat -En for the number of timeouts happenning. -Ross On Jul 13, 2009, at 9:05 AM, Galen wrote: Ross, I feel you here, but I don't have much of a solution. The best I can suggest (and has been my solution) is to take out the problematic disk, copy it to a fresh disk (preferably using something like dd_rescue) and then re-install. It seems the resilvering loop is generally a result of a faulty device, but even if it is taken offline, you still have issues. I have had so many zpool resilvering loops, it's not funny. I'm running 2009.06 with all updates applied. I've had a very, very bad batch of disks. I actually have a resilvering loop running right now, and I need to go copy off the offending device. Again. I wish I had a better solution, because the zpool functions fine, no data errors, but resilvering loops forever. I love ZFS as an on-disk format. I increasingly hate the implementation of ZFS software. -Galen On Jul 13, 2009, at 5:34 AM, Ross wrote: Just look at this. I thought all the restarting resilver bugs were fixed, but it looks like something odd is still happening at the start: Status immediately after starting resilver: # zpool status pool: rc-pool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 0h0m, 0.00% done, 57h3m to go config: NAME STATE READ WRITE CKSUM rc-pool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c4t1d0ONLINE 0 0 0 5.56M resilvered replacing DEGRADED 0 0 0 c4t2d0s0/o FAULTED 1.71M 23.3M 0 too many errors c4t2d0 ONLINE 0 0 0 5.43M resilvered c5t1d0ONLINE 0 0 0 5.55M resilvered And a few minutes later: # zpool status pool: rc-pool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 0h0m, 0.00% done, 245h21m to go config: NAME STATE READ WRITE CKSUM rc-pool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c4t1d0ONLINE 0 0 0 1.10M resilvered replacing DEGRADED 0 0 0 c4t2d0s0/o FAULTED 1.71M 23.3M 0 too many errors c4t2d0 ONLINE 0 0 0 824K resilvered c5t1d0ONLINE 0 0 0 1.10M resilvered It's gone from 5MB resilvered to 1MB, and increased the estimated time to 245 hours. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Alexander Skwar wrote: This is a M4000 mit 32 GB RAM and two HDs in a mirror. I think that you should edit the script to increase the file count since your RAM size is big enough to cache most of the data. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] first use send/receive... somewhat confused.
Dennis Clarke writes: > This will probably get me bombed with napalm but I often just > use star from Jörg Schilling because its dead easy : > > star -copy -p -acl -sparse -dump -C old_dir . new_dir > > and you're done.[1] > > So long as you have both the new and the old zfs/ufs/whatever[2] > filesystems mounted. It doesn't matter if they are static or not. If > anything changes on the filesystem then star will tell you about it. I'm not sure I see how that is easier. The command itself may be but it requires other moves not shown in your command. 1) zfs create z2/projects 2) star -copy -p -acl -sparse -dump -C old_dir . new_dir As a bare minimum would be required. whereas zfs send z1/proje...@snap |zfs receive z2/proje...@snap Is all that is necessary using zfs send receive, and the new filesystem z2/projects is created and populated with data from z1/projects, not to mention a snapshot at z2/projects/.zfs/snapshot ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] first use send/receive... somewhat confused.
> Dennis Clarke writes: > >> This will probably get me bombed with napalm but I often just >> use star from Jörg Schilling because its dead easy : >> >> star -copy -p -acl -sparse -dump -C old_dir . new_dir >> >> and you're done.[1] >> >> So long as you have both the new and the old zfs/ufs/whatever[2] >> filesystems mounted. It doesn't matter if they are static or not. If >> anything changes on the filesystem then star will tell you about it. > > I'm not sure I see how that is easier. > > The command itself may be but it requires other moves not shown in > your command. > > 1) zfs create z2/projects > > 2) star -copy -p -acl -sparse -dump -C old_dir . new_dir > > As a bare minimum would be required. > > whereas > zfs send z1/proje...@snap |zfs receive z2/proje...@snap > > Is all that is necessary using zfs send receive, and the new > filesystem z2/projects is created and populated with data from > z1/projects, not to mention a snapshot at z2/projects/.zfs/snapshot sort of depends on what you want to get done and both work. dc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Alexander Skwar wrote: Still on S10 U7 Sparc M4000. So I'm now inline with the other results - the 2nd run is WAY slower. 4x as slow. It would be good to see results from a few OpenSolaris users running a recent 64-bit kernel, and with fast storage to see if this is an OpenSolaris issue as well. It seems likely to be more evident with fast SAS disks or SAN devices rather than a few SATA disks since the SATA disks have more access latency. Pools composed of mirrors should offer less read latency as well. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] questions regarding RFE 6334757 and CR 6322205 disk write cache. thanks (case 11356581)
Hello experts, I would like consult you some questions regarding RFE 6334757 and CR 6322205 (disk write cache). == RFE 6334757 disk write cache should be enabled and should have a tool to switch it on and off CR 6322205 Enable disk write cache if ZFS owns the disk == The cu found on SPARC Enterprise T5140, "Disk Write cache" was disable when it was shifted from Sun factory, but after install ZFS, "Disk Write cache" turned to be enable. My questions are 1) When SPARC Enterprise T5140 was shifted from factory, is the value of Disk Write cache set to disabled, right ? From RFE 6334757, we can see it is disable when shifted from factory, but I am not sure.. 2) On Solaris10, after installing ZFS and Zone, will the value of Disk Write cache be set to enable ? After what action is done, the value of Disk Write cache will be changed ? From CR 6322205, we can see as long as ZFS owns the disk, "Disk Write cache" will be set to enable, but what is the operation for "ZFS owns the disk" ? CR 6322205 Enable disk write cache if ZFS owns the disk 3) If change the value of Disk Write cache from enabled to disable, is there any impact/problem to the system ? 4) For FRU parts, Write cache has been set to disabled, right ? Thank you very much. Best Regards chunhuan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris 2008.11 - resilver still restarting
No, I don't think I need to take a disk out. It's running ok now, it just seemed to get a bit confused at the start: $ zpool status pool: rc-pool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 2h11m, 31.09% done, 4h51m to go config: NAME STATE READ WRITE CKSUM rc-pool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c4t1d0ONLINE 0 0 0 101M resilvered replacing DEGRADED 0 0 0 c4t2d0s0/o FAULTED 1.71M 23.3M 0 too many errors c4t2d0 ONLINE 0 0 0 62.3G resilvered c5t1d0ONLINE 0 0 0 101M resilvered mirror ONLINE 0 0 0 c4t3d0ONLINE 0 0 0 c5t2d0ONLINE 0 0 0 c5t0d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t3d0ONLINE 0 0 0 c4t5d0ONLINE 0 0 0 c4t4d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t4d0ONLINE 0 0 0 c5t5d0ONLINE 0 0 0 c4t6d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t7d0ONLINE 0 13.0K 0 c5t6d0ONLINE 0 0 0 c5t7d0ONLINE 0 0 0 logs DEGRADED 0 0 0 c6d1p0 ONLINE 0 0 0 errors: No known data errors -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] first use send/receive... somewhat confused.
Harry Putnam wrote: > Dennis Clarke writes: > > > This will probably get me bombed with napalm but I often just > > use star from Jörg Schilling because its dead easy : > > > > star -copy -p -acl -sparse -dump -C old_dir . new_dir > > > > and you're done.[1] > > > > So long as you have both the new and the old zfs/ufs/whatever[2] > > filesystems mounted. It doesn't matter if they are static or not. If > > anything changes on the filesystem then star will tell you about it. > > I'm not sure I see how that is easier. > > The command itself may be but it requires other moves not shown in > your command. Could you please explain your claims? Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] first use send/receive... somewhat confused.
Joerg Schilling wrote: Harry Putnam wrote: Dennis Clarke writes: This will probably get me bombed with napalm but I often just use star from Jörg Schilling because its dead easy : star -copy -p -acl -sparse -dump -C old_dir . new_dir and you're done.[1] So long as you have both the new and the old zfs/ufs/whatever[2] filesystems mounted. It doesn't matter if they are static or not. If anything changes on the filesystem then star will tell you about it. I'm not sure I see how that is easier. The command itself may be but it requires other moves not shown in your command. Could you please explain your claims? star doesn't (and shouldn't) create the destination ZFS filesystem like the zfs recv would. It also doesn't preserve the dataset level would do. One the other hand using star (or rsync which is what I tend to do) gives more flexibility in that the source and destination filesystem types can be different or even not a filesystem! zfs send|recv and [g,s]tar exist for different purposes, but there are some overlapping use cases either either could do the job. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris 2008.11 - resilver still restarting
Gaaah, looks like I spoke too soon: $ zpool status pool: rc-pool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 2h59m, 77.89% done, 0h50m to go config: NAME STATE READ WRITE CKSUM rc-pool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c4t1d0ONLINE 0 0 0 218M resilvered replacing UNAVAIL 0 963K 0 insufficient replicas c4t2d0s0/o FAULTED 1.71M 23.4M 0 too many errors c4t2d0 REMOVED 0 964K 0 67.0G resilvered c5t1d0ONLINE 0 0 0 218M resilvered mirror ONLINE 0 0 0 c4t3d0ONLINE 0 0 0 c5t2d0ONLINE 0 0 0 c5t0d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t3d0ONLINE 0 0 0 c4t5d0ONLINE 0 0 0 c4t4d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t4d0ONLINE 0 0 0 c5t5d0ONLINE 0 0 0 c4t6d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t7d0ONLINE 0 13.0K 0 c5t6d0ONLINE 0 0 0 c5t7d0ONLINE 0 0 0 logs DEGRADED 0 0 0 c6d1p0 ONLINE 0 0 0 errors: No known data errors There are a whole bunch of errors in /var/adm/messages: Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.warning] WARNING: /p...@1,0/pci1022,7...@1/pci11ab,1...@2/d...@2,0 (sd3): Jul 13 15:56:53 rob-036 Error for Command: write(10) Error Level: Retryable Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.notice] Requested Block: 83778048 Error Block: 83778048 Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 Jul 13 15:57:31 rob-036 scsi: [ID 107833 kern.warning] WARNING: /p...@1,0/pci1022,7...@1/pci11ab,1...@2/d...@2,0 (sd3): Jul 13 15:57:31 rob-036 Command failed to complete...Device is gone Not what I would expect from a brand new drive!! Does anybody have any tips on how i can work out where the fault lies here? I wouldn't expect controller with so many other drives working, and what on earth is the proper technique for replacing a drive that failed part way through a resilver? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris 2008.11 - resilver still restarting
Ross, The disks do have problems - that's why I'm resilvering. I've seen zero read, write or checksum errors and had it loop. Now I do have a number of read errors on some of the disks, but I think resilvering is missing the point if it can't deal with corrupt data or disks with a small amount of unreadable data. -Galen On Jul 13, 2009, at 6:50 AM, Ross Walker wrote: Maybe it's the disks firmware that is bad or maybe they're jumpered for 1.5Gbps on a 3.0 only bus? Or maybe it's a problem with the disk cable/bay/enclosure/slot? It sounds like there is more then ZFS in the mix here. I wonder if the drive's status keeps flapping online/offline and either ZFS or FMA are too lax in marking a drive offline after recurring timeouts. Take a look at your disk enclosure and iostat -En for the number of timeouts happenning. -Ross On Jul 13, 2009, at 9:05 AM, Galen wrote: Ross, I feel you here, but I don't have much of a solution. The best I can suggest (and has been my solution) is to take out the problematic disk, copy it to a fresh disk (preferably using something like dd_rescue) and then re-install. It seems the resilvering loop is generally a result of a faulty device, but even if it is taken offline, you still have issues. I have had so many zpool resilvering loops, it's not funny. I'm running 2009.06 with all updates applied. I've had a very, very bad batch of disks. I actually have a resilvering loop running right now, and I need to go copy off the offending device. Again. I wish I had a better solution, because the zpool functions fine, no data errors, but resilvering loops forever. I love ZFS as an on- disk format. I increasingly hate the implementation of ZFS software. -Galen On Jul 13, 2009, at 5:34 AM, Ross wrote: Just look at this. I thought all the restarting resilver bugs were fixed, but it looks like something odd is still happening at the start: Status immediately after starting resilver: # zpool status pool: rc-pool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 0h0m, 0.00% done, 57h3m to go config: NAME STATE READ WRITE CKSUM rc-pool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c4t1d0ONLINE 0 0 0 5.56M resilvered replacing DEGRADED 0 0 0 c4t2d0s0/o FAULTED 1.71M 23.3M 0 too many errors c4t2d0 ONLINE 0 0 0 5.43M resilvered c5t1d0ONLINE 0 0 0 5.55M resilvered And a few minutes later: # zpool status pool: rc-pool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 0h0m, 0.00% done, 245h21m to go config: NAME STATE READ WRITE CKSUM rc-pool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c4t1d0ONLINE 0 0 0 1.10M resilvered replacing DEGRADED 0 0 0 c4t2d0s0/o FAULTED 1.71M 23.3M 0 too many errors c4t2d0 ONLINE 0 0 0 824K resilvered c5t1d0ONLINE 0 0 0 1.10M resilvered It's gone from 5MB resilvered to 1MB, and increased the estimated time to 245 hours. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris 2008.11 - resilver still restarting
On Jul 13, 2009, at 11:33 AM, Ross wrote: Gaaah, looks like I spoke too soon: $ zpool status pool: rc-pool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 2h59m, 77.89% done, 0h50m to go config: NAME STATE READ WRITE CKSUM rc-pool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c4t1d0ONLINE 0 0 0 218M resilvered replacing UNAVAIL 0 963K 0 insufficient replicas c4t2d0s0/o FAULTED 1.71M 23.4M 0 too many errors c4t2d0 REMOVED 0 964K 0 67.0G resilvered c5t1d0ONLINE 0 0 0 218M resilvered mirror ONLINE 0 0 0 c4t3d0ONLINE 0 0 0 c5t2d0ONLINE 0 0 0 c5t0d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t3d0ONLINE 0 0 0 c4t5d0ONLINE 0 0 0 c4t4d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t4d0ONLINE 0 0 0 c5t5d0ONLINE 0 0 0 c4t6d0ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t7d0ONLINE 0 13.0K 0 c5t6d0ONLINE 0 0 0 c5t7d0ONLINE 0 0 0 logs DEGRADED 0 0 0 c6d1p0 ONLINE 0 0 0 errors: No known data errors There are a whole bunch of errors in /var/adm/messages: Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.warning] WARNING: / p...@1,0/pci1022,7...@1/pci11ab,1...@2/d...@2,0 (sd3): Jul 13 15:56:53 rob-036 Error for Command: write (10) Error Level: Retryable Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.notice] Requested Block: 83778048 Error Block: 83778048 Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.notice] Vendor: ATASerial Number: Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 Jul 13 15:57:31 rob-036 scsi: [ID 107833 kern.warning] WARNING: / p...@1,0/pci1022,7...@1/pci11ab,1...@2/d...@2,0 (sd3): Jul 13 15:57:31 rob-036 Command failed to complete...Device is gone Not what I would expect from a brand new drive!! Does anybody have any tips on how i can work out where the fault lies here? I wouldn't expect controller with so many other drives working, and what on earth is the proper technique for replacing a drive that failed part way through a resilver? I really believe there is a problem with either the cabling or the enclosure's backplane here. Two disks is statistical coincidence, three disks means, it ain't the disks that are bad (if you checked and there was no recall and the firmware is correct and up to date). Fix the real problem and the disks already in place should resilver without further interruption. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] first use send/receive... somewhat confused.
Darren J Moffat wrote: > >>> use star from Jörg Schilling because its dead easy : > >>> > >>> star -copy -p -acl -sparse -dump -C old_dir . new_dir ... > star doesn't (and shouldn't) create the destination ZFS filesystem like > the zfs recv would. It also doesn't preserve the dataset level would do. As star is software that cleany lives above the filesystem layer, this is what people would expect ;-) > One the other hand using star (or rsync which is what I tend to do) > gives more flexibility in that the source and destination filesystem > types can be different or even not a filesystem! star is highly optimized and it's build in find(1) (using libfind) gives you many interesting features. zfs send seems to be tied to the zfs version and this is another reason why zfs send | receive may not even work on a 100% zfs based playground. > zfs send|recv and [g,s]tar exist for different purposes, but there are > some overlapping use cases either either could do the job. It would be nice if there was a discussion that does mention features instead of always proposing zfs send Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Interesting, I repeated the test on a few other machines running newer builds. First impressions are good: snv_114, virtual machine, 1GB RAM, 30GB disk - 16% slowdown. (Only 9GB free so I ran an 8GB test) Doing initial (unmount/mount) 'cpio -o > /dev/null' 1683 blocks real3m4.85s user0m16.74s sys 0m41.69s Doing second 'cpio -o > /dev/null' 1683 blocks real3m34.58s user0m18.85s sys 0m45.40s And again on snv_117, Sun x2200, 40GB RAM, single 500GB sata disk: First run (with the default 24GB set): real6m25.15s user0m11.93s sys 0m54.93s Doing second 'cpio -o > /dev/null' 48000247 blocks real1m9.97s user0m12.17s sys 0m57.80s ... d'oh! At least I know the ARC is working :-) The second run, with a 98GB test is running now, I'll post the results in the morning. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] first use send/receive... somewhat confused.
Joerg Schilling wrote: Darren J Moffat wrote: use star from Jörg Schilling because its dead easy : star -copy -p -acl -sparse -dump -C old_dir . new_dir ... star doesn't (and shouldn't) create the destination ZFS filesystem like the zfs recv would. It also doesn't preserve the dataset level would do. As star is software that cleany lives above the filesystem layer, this is what people would expect ;-) Indeed but that is why the "extra steps" are needed to create the destination ZFS filesystem. Not a bad thing or a criticism of star just a fact (and in answer to the question you asked). One the other hand using star (or rsync which is what I tend to do) gives more flexibility in that the source and destination filesystem types can be different or even not a filesystem! star is highly optimized and it's build in find(1) (using libfind) gives you many interesting features. I'm sure the authors of rsync could make a similar statement :-) zfs send seems to be tied to the zfs version and this is another reason why zfs send | receive may not even work on a 100% zfs based playground. Indeed but it isn't, like tar (and variants there of), an archiver but a means of providing replication of ZFS datasets based on ZFS snapshots and works at the ZFS DMU layer. zfs send|recv and [g,s]tar exist for different purposes, but there are some overlapping use cases either either could do the job. It would be nice if there was a discussion that does mention features instead of always proposing zfs send In general I completely agree, however this particular thread (given its title) is about zfs send|recv :-) -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
You might want to have a look at my blog on filesystem cache tuning... It will probably help you to avoid memory contention between the ARC and your apps. http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html Brad Brad Diggs Senior Directory Architect Virtualization Architect xVM Technology Lead Sun Microsystems, Inc. Phone x52957/+1 972-992-0002 Mail bradley.di...@sun.com Blog http://TheZoneManager.com Blog http://BradDiggs.com On Jul 4, 2009, at 2:48 AM, Phil Harman wrote: ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC instead of the Solaris page cache. But mmap() uses the latter. So if anyone maps a file, ZFS has to keep the two caches in sync. cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it copies into the Solaris page cache. As long as they remain there ZFS will be slow for those files, even if you subsequently use read(2) to access them. If you reboot, your cpio(1) tests will probably go fast again, until someone uses mmap(2) on the files again. I think tar(1) uses read(2), but from my iPod I can't be sure. It would be interesting to see how tar(1) performs if you run that test before cp(1) on a freshly rebooted system. I have done some work with the ZFS team towards a fix, but it is only currently in OpenSolaris. The other thing that slows you down is that ZFS only flushes to disk every 5 seconds if there are no synchronous writes. It would be interesting to see iostat -xnz 1 while you are running your tests. You may find the disks are writing very efficiently for one second in every five. Hope this helps, Phil blogs.sun.com/pgdh Sent from my iPod On 4 Jul 2009, at 05:26, Bob Friesenhahn wrote: On Fri, 3 Jul 2009, Bob Friesenhahn wrote: Copy MethodData Rate == cpio -pdum75 MB/s cp -r32 MB/s tar -cf - . | (cd dest && tar -xf -)26 MB/s It seems that the above should be ammended. Running the cpio based copy again results in zpool iostat only reporting a read bandwidth of 33 MB/second. The system seems to get slower and slower as it runs. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs rpool boot failed
On 07/11/09 05:15, iman habibi wrote: Dear Admins I had solaris 10u8 installation based on ZFS (rpool)filesystem on two mirrored scsi disks in sunfire v880. but after some months,when i reboot server with reboot command,it didnt boot from disks,and returns cant boot from boot media. how can i recover some data from my previous installation? also i run >boot disk0 (failed) >boot disk1 (failed) also run >probe-scsi-all,,then boot from each disk,it returns failed,,why? thank for any guide Regards Someone with more knowledge of the boot proms might have to help you with the boot failures, but if you're looking for a way to recover data from the root pools, you could try booting from your installation medium (whether that's a local CD/DVD or a network installation image), escaping out of the install, and try importing the pool. Lori ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] first use send/receive... somewhat confused.
joerg.schill...@fokus.fraunhofer.de (Joerg Schilling) writes: > Harry Putnam wrote: > >> Dennis Clarke writes: >> >> > This will probably get me bombed with napalm but I often just >> > use star from Jörg Schilling because its dead easy : >> > >> > star -copy -p -acl -sparse -dump -C old_dir . new_dir >> > >> > and you're done.[1] >> > >> > So long as you have both the new and the old zfs/ufs/whatever[2] >> > filesystems mounted. It doesn't matter if they are static or not. If >> > anything changes on the filesystem then star will tell you about it. >> >> I'm not sure I see how that is easier. >> >> The command itself may be but it requires other moves not shown in >> your command. > > Could you please explain your claims? Well it may be a case of newbie shooting off mouth on basis of small knowledge but the setup you showed with star does not create a zfs filesystem. I guess that would have to be done externally. Whereas send/receive does that part for you. My first thought was rsync... as a long time linux user... thats where I would usually turn... until other posters pointed out how send/receive works. i.e. It creates a new zfs filesystem for you, which was exactly what I was after. So by using send/receive with -u I was able in one move to 1) create a zfs filesystem 2) mount it automatically 3) transfer the data to the new fs `star' only does the last one right? I then had a few external chores like setting options or changing mountpoint. Something that would have had to be done using `star' too. (That is, before using star) That alone was the basis of what you call my `claims'. Would `star' move or create the .zfs directory? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Brad Diggs wrote: You might want to have a look at my blog on filesystem cache tuning... It will probably help you to avoid memory contention between the ARC and your apps. http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html Your post makes it sound like there is not a bug in the operating system. It does not take long to see that there is a bug in the Solaris 10 operating system. It is not clear if the same bug is shared by current OpenSolaris since it seems like it has not been tested. Solaris 10 U7 reads files that it has not seen before at a constant rate regardless of the amount of file data it has already read. When the file is read a second time, the read is 4X or more slower. If reads were slowing down because the ARC was slow to expunge stale data, then that would be apparent on the first read pass. However, the reads are not slowing down in the first read pass. ZFS goes into the weeds if it has seen a file before but none of the file data is resident in the ARC. It is pathetic that a Sun RAID array that I paid $21K for out of my own life savings is not able to perform better than the cheapo portable USB drives that I use for backup because of ZFS. This is making me madder and madder by the minute. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Sun X4500 (thumper) with 16Gb of memory running Solaris 10 U6 with patches current to the end of Feb 2009. Current ARC size is ~6Gb. ZFS filesystem created in a ~3.2 Tb pool consisting of 7 sets of mirrored 500Gb SATA drives. I used 4000 8Mb files for a total of 32Gb. run 1: ~140M/s average according to zpool iostat real4m1.11s user0m10.44s sys 0m50.76s run 2: ~37M/s average according to zpool iostat real13m53.43s user0m10.62s sys 0m55.80s A zfs unmount followed by a mount of the filesystem returned the performance to the run 1 case. real3m58.16s user0m11.54s sys 0m51.95s In summary, the second run performance drops to about 30% of the original run. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 13, 2009 at 9:34 AM, Bob Friesenhahn wrote: > On Mon, 13 Jul 2009, Alexander Skwar wrote: >> >> Still on S10 U7 Sparc M4000. >> >> So I'm now inline with the other results - the 2nd run is WAY slower. 4x >> as slow. > > It would be good to see results from a few OpenSolaris users running a > recent 64-bit kernel, and with fast storage to see if this is an OpenSolaris > issue as well. Indeed it is. Using ldoms with tmpfs as the backing store for virtual disks, I see: With S10u7: # ./zfs-cache-test.ksh testpool zfs create testpool/zfscachetest Creating data file set (300 files of 8192000 bytes) under /testpool/zfscachetest ... Done! zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 4800025 blocks real0m30.35s user0m9.90s sys 0m19.81s Doing second 'cpio -o > /dev/null' 4800025 blocks real0m43.95s user0m9.67s sys 0m17.96s Feel free to clean up with 'zfs destroy testpool/zfscachetest'. # ./zfs-cache-test.ksh testpool zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 4800025 blocks real0m31.14s user0m10.09s sys 0m20.47s Doing second 'cpio -o > /dev/null' 4800025 blocks real0m40.24s user0m9.68s sys 0m17.86s Feel free to clean up with 'zfs destroy testpool/zfscachetest'. When I move the zpool to a 2009.06 ldom, # /var/tmp/zfs-cache-test.ksh testpool zfs create testpool/zfscachetest Creating data file set (300 files of 8192000 bytes) under /testpool/zfscachetest ... Done! zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 4800025 blocks real0m30.09s user0m9.58s sys 0m19.83s Doing second 'cpio -o > /dev/null' 4800025 blocks real0m44.21s user0m9.47s sys 0m18.18s Feel free to clean up with 'zfs destroy testpool/zfscachetest'. # /var/tmp/zfs-cache-test.ksh testpool zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 4800025 blocks real0m29.89s user0m9.58s sys 0m19.72s Doing second 'cpio -o > /dev/null' 4800025 blocks real0m44.40s user0m9.59s sys 0m18.24s Feel free to clean up with 'zfs destroy testpool/zfscachetest'. Notice in these runs that each time the usr+sys time of the first run adds up to the elapsed time - the rate was choked by CPU. This is verified by "prstat -mL". The second run seemed to be slow due to a lock as we had just demonstrated that the IO path can do more (not an IO bottleneck) and "prstat -mL shows cpio at in sleep for a significant amount of time. FWIW, I hit another bug if I turn off primarycache. http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 This causes really abysmal performance - but equally so for repeat runs! # /var/tmp/zfs-cache-test.ksh testpool zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 4800025 blocks real4m21.57s user0m9.72s sys 0m36.30s Doing second 'cpio -o > /dev/null' 4800025 blocks real4m21.56s user0m9.72s sys 0m36.19s Feel free to clean up with 'zfs destroy testpool/zfscachetest'. This bug report contains more detail of the configuration. One thing not covered in that bug report is that the S10u7 ldom has 2048 MB of RAM and the 2009.06 ldom has 2024 MB of RAM. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions regarding RFE 6334757 and CR 6322205 disk write cache. thanks (case 11356581)
1) Turning on write caching is potentially dangerous because the disk will indicate that data has been written (to cache) before it has actually been written to non-volatile storage (disk). Since the factory has no way of knowing how you'll use your T5140, I'm guessing that they set the disk write caches off by default. 2) Since ZFS "knows" about disk caches and ensures that it issues synchronous writes where required, it is safe to turn on write caching when the *ENTIRE* disk is used for ZFS. Accordingly, ZFS will attempt to turn on a disk's write cache whenever you add the *ENTIRE* disk to a zpool. If you add only a disk slice to a zpool, ZFS will not try to turn on write caching since it doesn't know whether other portions of the disk will be used for applications which are not write-cache safe. zpool create pool01 c0t0d0<- ZFS will try to turn on disk write cache since using entire disk zpool create pool02 c0t0d0s1<- ZFS will not try to turn on disk write cache (only using 1 slice) To avoid future disk replacement problems (e.g. if the replacement disk is slightly smaller), we generally create a single disk slice that takes up almost the entire disk and then build our pools on these slices. ZFS doesn't turn on the write cache in this case, but since we know that the disk is only being used for ZFS we can (and do!) safety turn on the write cache manually. 3) You can change the write (and read) cache settings using the "cache" submenu of the "format -e" command. If you disable the write cache where it could safely be enabled you will only reduce the performance of the system. If you enable the write cache where it should not be enabled, you run the risk of data loss and/or corruption in the event of a power loss. 4) I wouldn't assume any particular setting for FRU parts, although I believe that Sun parts generally ship with the write caches disabled. Better to explicitly check using "format -e". -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions regarding RFE 6334757 and CR 6322205 disk write cache. thanks (case 11356581)
Something caused my original message to get cut off. Here is the full post: 1) Turning on write caching is potentially dangerous because the disk will indicate that data has been written (to cache) before it has actually been written to non-volatile storage (disk). Since the factory has no way of knowing how you'll use your T5140, I'm guessing that they set the disk write caches off by default. 2) Since ZFS "knows" about disk caches and ensures that it issues synchronous writes where required, it is safe to turn on write caching when the *ENTIRE* disk is used for ZFS. Accordingly, ZFS will attempt to turn on a disk's write cache whenever you add the *ENTIRE* disk to a zpool. If you add only a disk slice to a zpool, ZFS will not try to turn on write caching since it doesn't know whether other portions of the disk will be used for applications which are not write-cache safe. zpool create pool01 c0t0d0ZFS will try to turn on disk write cache since using entire disk zpool create pool02 c0t0d0s1 ZFS will not try to turn on disk write cache (only using 1 slice) To avoid future disk replacement problems (e.g. if the replacement disk is slightly smaller), we generally create a single disk slice that takes up almost the entire disk and then build our pools on these slices. ZFS doesn't turn on the write cache in this case, but since we know that the disk is only being used for ZFS we can (and do!) safety turn on the write cache manually. 3) You can change the write (and read) cache settings using the "cache" submenu of the "format -e" command. If you disable the write cache where it could safely be enabled you will only reduce the performance of the system. If you enable the write cache where it should not be enabled, you run the risk of data loss and/or corruption in the event of a power loss. 4) I wouldn't assume any particular setting for FRU parts, although I believe that Sun parts generally ship with the write caches disabled. Better to explicitly check using "format -e". -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Mike Gerdts wrote: FWIW, I hit another bug if I turn off primarycache. http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 This causes really abysmal performance - but equally so for repeat runs! It is quite facinating seeing the huge difference in I/O performance from these various reports. The bug you reported seems likely to be that without at least a little bit of caching, it is necessary to re-request the underlying 128K ZFS block several times as the program does numerous smaller I/Os (cpio uses 10240 bytes?) across it. Totally disabling data caching seems best reserved for block-oriented databases which are looking for a substitute for directio(3C). It is easily demonstrated that the problem seen in Solaris 10 (jury still out on OpenSolaris although one report has been posted) is due to some sort of confusion. It is not due to delays caused by purging old data from the ARC. If these delays were caused by purging data from the ARC, then 'zfs iostat' would start showing lower read performance once the ARC becomes full, but that is not the case. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: > On Mon, 13 Jul 2009, Mike Gerdts wrote: > > > > FWIW, I hit another bug if I turn off primarycache. > > > > http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 > > > > This causes really abysmal performance - but equally so for repeat runs! > > It is quite facinating seeing the huge difference in I/O performance > from these various reports. The bug you reported seems likely to be > that without at least a little bit of caching, it is necessary to > re-request the underlying 128K ZFS block several times as the program > does numerous smaller I/Os (cpio uses 10240 bytes?) across it. cpio reads/writes in 8192 byte chunks from the filesystem. BTW: star by default creates a shared memory based FIFO of 8 MB size and reads in the biggest possible size that would currently fit into the FIFO. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob - Have you filed a bug on this issue? I am not up to speed on this thread, so I can not comment on whether or not there is a bug here, but you seem to have a test case and supporting data. Filing a bug will get the attention of ZFS engineering. Thanks, /jim Bob Friesenhahn wrote: On Mon, 13 Jul 2009, Mike Gerdts wrote: FWIW, I hit another bug if I turn off primarycache. http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 This causes really abysmal performance - but equally so for repeat runs! It is quite facinating seeing the huge difference in I/O performance from these various reports. The bug you reported seems likely to be that without at least a little bit of caching, it is necessary to re-request the underlying 128K ZFS block several times as the program does numerous smaller I/Os (cpio uses 10240 bytes?) across it. Totally disabling data caching seems best reserved for block-oriented databases which are looking for a substitute for directio(3C). It is easily demonstrated that the problem seen in Solaris 10 (jury still out on OpenSolaris although one report has been posted) is due to some sort of confusion. It is not due to delays caused by purging old data from the ARC. If these delays were caused by purging data from the ARC, then 'zfs iostat' would start showing lower read performance once the ARC becomes full, but that is not the case. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] deduplication
> "jcm" == James C McPherson writes: > "dm" == David Magda writes: jcm> What I can say, however, is that "open source" does not always jcm> equate to requiring "open development". +1 To maintain what draws me to free software, you must * release binaries and source at the same time that also means none of this bullshit where you send someone a binary of your work for ``testing''. BSD developers do this all the time not really meaning anything bad by it, but for CDDL or GPL both by law and by custom, you do ``testing'' then you get to see source, period. * allow free enough access to the source that whoever gets it can fork and continue development under any organizing process they want. The organizing process for development is also worth talking about, but for me it isn't such a clear political movement. Even the projects that unlike Solaris have always been open, where openness is their core goal above anything else, still benefit from openbsd hackathons, the .nl HAR camp, and other meetings where insiders who know each other personally sequester themselves in physical proximity and privately work on something which they release all at once when the camping trip is over. Private development branches can be good, and certainly don't scare me away from a project the same way as intentional GPL incompatibility, closed-source stable branches, proprietary installer-maker build scripts, scattering of binary blobs throughout the tree, selling hardware as a VAR then dropping the ball getting free drivers out of the OEM's, and so on. There are other organizing things I absolutely do have a problem with. For example, attracting discussion to censored web forums (which on OpenSolaris we do NOT have because here the forums are just extra-friendly mailing list archives plus a posting interface for web20 idiots, but many Linux subprojects do have censored forums). And PR-expurgated read-only bug databases (which OpenSolaris does have while Ubuntu, Debian, Gentoo, u.s.w. do not). There's a second problem with GPL at Akamai and Google. Suppose Greenbytes wrote dedup changes but didn't release their source, then started selling deduplicated hosted storage over vlan in several major telco hotels. I'd have a political/community-advocacy problem with that, and probably no legal remedy. pgplo4voDJeYe.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Joerg Schilling wrote: cpio reads/writes in 8192 byte chunks from the filesystem. Yes, I was just reading the cpio manual page and see that. I think that re-reading the 128K zfs block 16 times to satisfy each request for 8192 bytes explains the 16X performance loss when caching is disabled. I don't think that this is strictly a bug since it is what the database folks are looking for. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 13, 2009 at 3:16 PM, Joerg Schilling wrote: > Bob Friesenhahn wrote: > >> On Mon, 13 Jul 2009, Mike Gerdts wrote: >> > >> > FWIW, I hit another bug if I turn off primarycache. >> > >> > http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 >> > >> > This causes really abysmal performance - but equally so for repeat runs! >> >> It is quite facinating seeing the huge difference in I/O performance >> from these various reports. The bug you reported seems likely to be >> that without at least a little bit of caching, it is necessary to >> re-request the underlying 128K ZFS block several times as the program >> does numerous smaller I/Os (cpio uses 10240 bytes?) across it. > > cpio reads/writes in 8192 byte chunks from the filesystem. > > BTW: star by default creates a shared memory based FIFO of 8 MB size and > reads in the biggest possible size that would currently fit into the FIFO. > > Jörg Using cpio's -C option seems to not change the behavior for this bug, but I did see a performance difference with the case where I hadn't modified the zfs caching behavior. That is, the performance of the tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024)) >/dev/null". At this point cpio was spending roughly 13% usr and 87% sys. I haven't tried star, but I did see that I could also reproduce with "cat $file | cat > /dev/null". This seems like a worthless use of cat, but it forces cat to actually copy data from input to output unlike when cat can mmap input and output. When it does that and output is /dev/null Solaris is smart enough to avoid any reads. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 13, 2009 at 3:23 PM, Bob Friesenhahn wrote: > On Mon, 13 Jul 2009, Joerg Schilling wrote: >> >> cpio reads/writes in 8192 byte chunks from the filesystem. > > Yes, I was just reading the cpio manual page and see that. I think that > re-reading the 128K zfs block 16 times to satisfy each request for 8192 > bytes explains the 16X performance loss when caching is disabled. I don't > think that this is strictly a bug since it is what the database folks are > looking for. > > Bob I did other tests with "dd bs=128k" and verified via truss that each read(2) was returning 128K. I thought I had seen excessive reads there too, but now I can't reproduce that. Creating another fs with recordsize=8k seems to make this behavior go away - things seem to be working as designed. I'll go update the (nota-)bug. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Jul 13, 2009, at 2:54 PM, Bob Friesenhahn > wrote: On Mon, 13 Jul 2009, Brad Diggs wrote: You might want to have a look at my blog on filesystem cache tuning... It will probably help you to avoid memory contention between the ARC and your apps. http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html Your post makes it sound like there is not a bug in the operating system. It does not take long to see that there is a bug in the Solaris 10 operating system. It is not clear if the same bug is shared by current OpenSolaris since it seems like it has not been tested. Solaris 10 U7 reads files that it has not seen before at a constant rate regardless of the amount of file data it has already read. When the file is read a second time, the read is 4X or more slower. If reads were slowing down because the ARC was slow to expunge stale data, then that would be apparent on the first read pass. However, the reads are not slowing down in the first read pass. ZFS goes into the weeds if it has seen a file before but none of the file data is resident in the ARC. It is pathetic that a Sun RAID array that I paid $21K for out of my own life savings is not able to perform better than the cheapo portable USB drives that I use for backup because of ZFS. This is making me madder and madder by the minute. Have you tried limiting the ARC so it doesn't squash the page cache? Make sure page cache has enough for mmap plus buffers for bouncing between it and the ARC. I would say 1GB minimum, 2 to be safe. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Mike Gerdts wrote: Using cpio's -C option seems to not change the behavior for this bug, but I did see a performance difference with the case where I hadn't modified the zfs caching behavior. That is, the performance of the tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024)) /dev/null". At this point cpio was spending roughly 13% usr and 87% sys. Interesting. I just updated zfs-cache-test.ksh on my web site so that it uses 131072 byte blocks. I see a tiny improvement in performance from doing this, but I do see a bit less CPU consumption so the CPU consumption is essentially zero. The bug remains. It seems best to use ZFS's ideal block size so that issues don't get confused. Using an ARC monitoring script called 'arcstat.pl' I see a huge number of 'dmis' events when performance is poor. The ARC size is 7GB, which is less than its prescribed cap of 10GB. Better: Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 15:39:37 20K1K 65801K 10019 100 7G 10G 15:39:38 19K1K 55701K 10019 100 7G 10G 15:39:39 19K1K 65401K 10018 100 7G 10G 15:39:40 17K1K 65101K 10017 100 7G 10G Worse: Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 15:43:244K 280 6 2806 00 4 100 9G 10G 15:43:254K 277 6 2776 00 4 100 9G 10G 15:43:264K 268 6 2686 00 5 100 9G 10G 15:43:274K 259 6 2596 00 4 100 9G 10G An ARC stats summary from a tool called 'arc_summary.pl' is appended to this message. Operation is quite consistent across the full span of files. Since 'dmis' is still low when things are "good" (and even when the ARC has surely cycled already) this leads me to believe that prefetch is mostly working and is usually satisfying read requests. When things go bad I see that 'dmiss' becomes 100% of the misses. A hypothesis is that if zfs thinks that the data might be in the ARC (due to having seen the file before) that it disables file prefetch entirely, assuming that it can retrieve the data from its cache. Then once it finally determines that there is no cached data after all, it issues a read request. Even the "better" read performance is 1/2 of what I would expect from my hardware and based on prior test results from 'iozone'. More prefetch would surely help. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ System Memory: Physical RAM: 20470 MB Free Memory : 2511 MB LotsFree: 312 MB ZFS Tunables (/etc/system): * set zfs:zfs_arc_max = 0x3 set zfs:zfs_arc_max = 0x28000 * set zfs:zfs_arc_max = 0x2 set zfs:zfs_write_limit_override = 0xea60 * set zfs:zfs_write_limit_override = 0xa000 set zfs:zfs_vdev_max_pending = 5 ARC Size: Current Size: 8735 MB (arcsize) Target Size (Adaptive): 10240 MB (c) Min Size (Hard Limit):1280 MB (zfs_arc_min) Max Size (Hard Limit):10240 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 95%9791 MB (p) Most Frequently Used Cache Size: 4%448 MB (c-p) ARC Efficency: Cache Access Total: 827767314 Cache Hit Ratio: 96% 800123657 [Defined State for buffer] Cache Miss Ratio: 3% 27643657 [Undefined State for Buffer] REAL Hit Ratio: 89% 743665046 [MRU/MFU Hits Only] Data Demand Efficiency:99% Data Prefetch Efficiency:61% CACHE HITS BY CACHE LIST: Anon:5%47497010 [ New Customer, First Cache Hit ] Most Recently Used: 33%271365449 (mru)[ Return Customer ] Most Frequently Used: 59%472299597 (mfu)[ Frequent Customer ] Most Recently Used Ghost:0%1700764 (mru_ghost)[ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 0%7260837 (mfu_ghost)[ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data:73%589582518 Prefetch Data: 2%20424879 Demand Metadata:17%139111510 Prefetch Metadata: 6%51004750 CACHE MISSES BY DATA TYPE: Demand Data:21%5814459 Prefetch Data: 46%12788265 Demand Metadata:27%7700169 Prefetch Metada
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Ross Walker wrote: Have you tried limiting the ARC so it doesn't squash the page cache? Yes, the ARC is limited to 10GB, leaving another 10GB for the OS and applications. Resource limits are not the problem. There is a ton of memory and CPU to go around. Current /etc/system tunables: set maxphys = 0x2 set zfs:zfs_arc_max = 0x28000 set zfs:zfs_write_limit_override = 0xea60 set zfs:zfs_vdev_max_pending = 5 Make sure page cache has enough for mmap plus buffers for bouncing between it and the ARC. I would say 1GB minimum, 2 to be safe. In this testing mmap is not being used (cpio does not use mmap) so the page cache is not an issue. It does become an issue for 'cp -r' though where we see the I/O be substantially (and essentially permanently) reduced even more for impacted files until the filesystem is unmounted. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: There has been no forward progress on the ZFS read performance issue for a week now. A 4X reduction in file read performance due to having read the file before is terrible, and of course the situation is considerably worse if the file was previously mmapped as well. Many of us have sent a lot of money to Sun and were not aware that ZFS is sucking the life out of our expensive Sun hardware. It is trivially easy to reproduce this problem on multiple machines. For example, I reproduced it on my Blade 2500 (SPARC) which uses a simple mirrored rpool. On that system there is a 1.8X read slowdown from the file being accessed previously. In order to raise visibility of this issue, I invite others to see if they can reproduce it in their ZFS pools. The script at http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Implements a simple test. It requires a fair amount of disk space to run, but the main requirement is that the disk space consumed be more than available memory so that file data gets purged from the ARC. The script needs to run as root since it creates a filesystem and uses mount/umount. The script does not destroy any data. There are several adjustments which may be made at the front of the script. The pool 'rpool' is used by default, but the name of the pool to test may be supplied via an argument similar to: # ./zfs-cache-test.ksh Sun_2540 zfs create Sun_2540/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /Sun_2540/zfscachetest ... Done! zfs unmount Sun_2540/zfscachetest zfs mount Sun_2540/zfscachetest I've opened the following bug to track this issue: 6859997 zfs caching performance problem We need to track down if/when this problem was introduced or if it has always been there. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: > On Mon, 13 Jul 2009, Joerg Schilling wrote: > > > > cpio reads/writes in 8192 byte chunks from the filesystem. > > Yes, I was just reading the cpio manual page and see that. I think > that re-reading the 128K zfs block 16 times to satisfy each request > for 8192 bytes explains the 16X performance loss when caching is > disabled. I don't think that this is strictly a bug since it is what > the database folks are looking for. cpio spends 1.6x more SYStem CPU time than star. This may mainly be a result from the fact that cpio (when using the cpio archive format) reads/writes 512 byte blocks from/to the archive file. cpio by default spends 19x more USER CPU time than star. This seems to be a result of the inapropriate header structure with the cpio archive format and reblocking and cannot be easily changed (well you could use "scpio" - or in other words the "cpio" CLI personality of star, but this reduces the USER CPU time only by 10%-50% compared to Sun cpio). cpio is a program from the past that does no fit well in our current world. The internal limits cannot be lifted without creating a new incompatible archive format. In other words: if you use cpio for your work, you have to live with it's problems ;-) If you like to play with different parameter values (e.g. read sizes), cpio is unsuitable for tests. Star allows you to set big filesystem read sizes by using the FIFO and playing with the fifo size and smell filesystem read sizes by switching off the FIFO and playing with the archive block size. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Mike Gerdts wrote: > Using cpio's -C option seems to not change the behavior for this bug, > but I did see a performance difference with the case where I hadn't > modified the zfs caching behavior. That is, the performance of the > tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024)) > >/dev/null". At this point cpio was spending roughly 13% usr and 87% > sys. As mentioned before, a lot of the user CPU time from cpio is spend to create cpio archive headers or caused by the fact that cpio archives copy the file content to unaligned archive locations while the "tar" archive format starts each new file on a modulo 512 offset in the archive. This requires a lot of unneeded copying of file data. You can of course slightly modify parameters even with cpio. I am not sure what you mean with "13% usr and 87%" as star typically spends 6% of the wall clock time in user+sys CPU where the user CPU time is typically only 1.5% of the system CPU time. In the "cached" case, it is obviously ZFS that's responsible for the slow down, regardless what cpio did in the other case. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: > On Mon, 13 Jul 2009, Mike Gerdts wrote: > > > > Using cpio's -C option seems to not change the behavior for this bug, > > but I did see a performance difference with the case where I hadn't > > modified the zfs caching behavior. That is, the performance of the > > tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024)) > >> /dev/null". At this point cpio was spending roughly 13% usr and 87% > > sys. > > Interesting. I just updated zfs-cache-test.ksh on my web site so that > it uses 131072 byte blocks. I see a tiny improvement in performance > from doing this, but I do see a bit less CPU consumption so the CPU > consumption is essentially zero. The bug remains. It seems best to > use ZFS's ideal block size so that issues don't get confused. If you continue to use cpio and the cpio archive format, you force copying a lot of data as the cpio archive format does use odd header sizes and starts new files "unaligned" directly after the archive header. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Jim Mauro wrote: Bob - Have you filed a bug on this issue? I am not up to speed on this thread, so I can not comment on whether or not there is a bug here, but you seem to have a test case and supporting data. Filing a bug will get the attention of ZFS engineering. No, I have not filed a bug report yet. Any problem report to Sun's Service department seems to require at least one day's time. I was curious to see if recent OpenSolaris suffers from the same problem, but posted results (thus far) are not as conclusive as they are for Solaris 10. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 13, 2009 at 4:41 PM, Bob Friesenhahn wrote: > On Mon, 13 Jul 2009, Jim Mauro wrote: > >> Bob - Have you filed a bug on this issue? I am not up to speed on this >> thread, so I can not comment on whether or not there is a bug here, but you >> seem to have a test case and supporting data. Filing a bug will get the >> attention of ZFS engineering. > > No, I have not filed a bug report yet. Any problem report to Sun's Service > department seems to require at least one day's time. > > I was curious to see if recent OpenSolaris suffers from the same problem, > but posted results (thus far) are not as conclusive as they are for Solaris > 10. It doesn't seem to be quite as bad as S10, but there is certainly a hit. # /var/tmp/zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (400 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 6400033 blocks real1m26.16s user0m12.83s sys 0m25.88s Doing second 'cpio -o > /dev/null' 6400033 blocks real2m44.46s user0m12.59s sys 0m24.34s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. # cat /etc/release OpenSolaris 2009.06 snv_111b SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 07 May 2009 # uname -srvp SunOS 5.11 snv_111b sparc -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Mark Shellenbaum wrote: I've opened the following bug to track this issue: 6859997 zfs caching performance problem We need to track down if/when this problem was introduced or if it has always been there. I think that it has always been there as long as I have been using ZFS (1-3/4 years). Sometimes it takes a while for me to wake up and smell the coffee. Meanwhile I have opened a formal service request (IBIS 71326296) with Sun Support. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Joerg Schilling wrote: If you continue to use cpio and the cpio archive format, you force copying a lot of data as the cpio archive format does use odd header sizes and starts new files "unaligned" directly after the archive header. Note that the output of cpio is sent to /dev/null in this test so it is only the reading part which is significant as long as cpio's CPU use is low. Sun Service won't have a clue about 'star' since it is not part of Solaris 10. It is best to stick with what they know so the problem report won't be rejected. If star is truely more efficient than cpio, it may make the difference even more obvious. What did you discover when you modified my test script to use 'star' instead? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob: Sun v490, 4x1.35 processors, 32GB ram, Solaris 10u7 working with a raidz1 zpool made up of 6x146 sas drives on a j4200. Results of your running your script: # zfs-cache-test.ksh pool2 zfs create pool2/zfscachetest Creating data file set (6000 files of 8192000 bytes) under /pool2/zfscachetest ... Done! zfs unmount pool2/zfscachetest zfs mount pool2/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null' 96000512 blocks real5m32.58s user0m12.75s sys 2m56.58s Doing second 'cpio -C 131072 -o > /dev/null' 96000512 blocks real17m26.68s user0m12.97s sys 4m34.33s Feel free to clean up with 'zfs destroy pool2/zfscachetest'. # Same results as you are seeing. Thanks Randy -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs snapshoot of rpool/* to usb removable drives?
last question! I promise :) Google's not helped me much, but that's probably my keyword-ignorance. I have two USB HDD's, I want to swap them over, so there's one off-site and one plugged in, they get swapped over weekly. Not perfect, but sufficient for this site's risk assessment. They'll have the relevant ZFS snapshots sent to them. I assume then that if I plug one of these drives into another box that groks ZFS it'll see a filesystem and be able to access the files etc. I formatted two drives, 1TB drives. The tricky bit I think, is swapping them. I can mount one, and then send/recv to it, but what's the best way to automate the process of swapping the drives? A human has to physically switch them on & off and plug them in etc, but what's the process to do it in ZFS? Does each drive need a separate mountpoint? In the old UFS days I'd have just mounted them from an entry in (v)fstab in a cronjob and they'd be the same as far as everything was concerned, but with ZFS I'm a little confused. Can anyone here outline the procedure to do this assuming that the USB drives will be plugged into the same USB port (the server will be in a cabinet, the backup drives outside of it so they don't have to open the cabinet, and thus, bump things that don't like to be bumped!). Thankyou again for everyone's help. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Ok, build 117 does seem a lot better. The second run is slower, but not by such a huge margin. This was the end of the 98GB test: Creating data file set (12000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o > /dev/null' 192000985 blocks real26m17.80s user0m47.55s sys 3m56.94s Doing second 'cpio -o > /dev/null' 192000985 blocks real27m14.35s user0m46.84s sys 4m39.85s -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss