[zfs-discuss] Benchmarking Methodologies
I'm doing a little research study on ZFS benchmarking and performance profiling. Like most, I've had my favorite methods, but I'm re-evaluating my choices and trying to be a bit more scientific than I have in the past. To that end, I'm curious if folks wouldn't mind sharing their work on the subject? What tool(s) to you prefer in what situations? Do you have a standard method of running them (tool args; block sizes, thread counts, ...) or procedures between runs (zpool import/export, new dataset creation,...)? etc. Any feedback is appreciated. I want to get a good sampling of opinions. Thanks! benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benchmarking Methodologies
On 4/21/10 2:15 AM, Robert Milkowski wrote: > I haven't heard from you in a while! Good to see you here again :) > > Sorry for stating obvious but at the end of a day it depends on what > your goals are. > Are you interested in micro-benchmarks and comparison to other file > systems? > > I think the most relevant filesystem benchmarks for users is when you > benchmark a specific application and present results from an > application point of view. For example, given a workload for Oracle, > MySQL, LDAP, ... how quickly it completes? How much benefit there is > by using SSDs? What about other filesystems? > > Micro-benchmarks are fine but very hard to be properly interpreted by > most users. > > Additionally most benchmarks are almost useless if they are not > compared to some other configuration with only a benchmarked component > changed. For example, knowing that some MySQL load completes in 1h on > ZFS is basically useless. But knowing that on the same HW with > Linux/ext3 and under the same load it completes in 2h would be > interesting to users. > > Other interesting thing would be to see an impact of different ZFS > setting on a benchmark results (aligned recordsize for database vs. > default, atime off vs. on, lzjb, gzip, ssd). Also comparison of > benchmark results with all default zfs setting compared to whatever > setting you did which gave you the best result. Hey Robert... I'm always around. :) You've made an excellent case for benchmarking and where its useful but what I'm asking for on this thread is for folks to share the research they've done with as much specificity as possible for research purposes. :) Let me illustrate: To Darren's point on FileBench and vdbench... to date I've found these two to be the most useful. IOzone, while very popular, has always given me strange results which are inconsistent regardless of how large the block and data is. Given that the most important aspect of any benchmark is repeatability and sanity in results, I've found no value in IOzone any longer. vdbench has become my friend particularly in the area of physical disk profiling. Before tuning ZFS (or any filesystem) its important to find a solid baseline of performance on the underlying disk structure. So using a variety of vdbench profiles such as the following help you pinpoint exactly the edges of the performance envelope: sd=sd1,lun=/dev/rdsk/c0t1d0s0,threads=1 wd=wd1,sd=sd1,readpct=100,rhpct=0,seekpct=0 rd=run1,wd=wd1,iorate=max,elapsed=10,interval=1,forxfersize=(4k-4096k,d) With vdbench and the workload above I can get consistent, reliable results time after time and the results on other systems match. This is particularly key if your running a hardware RAID controller under ZFS. There isn't anything dd can do that vdbench can't do better. Using a workload like above both at differing xfer sizes and also at differing thread counts really helps give an accurate picture of the disk capabilities. Moving up into the filesystem. I've been looking intently at improving my FileBench profiles, based on the supplied ones with tweaking. I'm trying to get to a methodology that provides me with time-after-time repeatable results for real comparison between systems. I'm looking hard at vdbench file workloads, but they aren't yet nearly as sophisticated as FileBench. I am also looking at FIO (http://freshmeat.net/projects/fio/), which is FileBench-esce. At the end of the day, I agree entirely that application benchmarks are far more effective judges... but they are also more time consuming and less flexible than dedicated tools. The key is honing generic benchmarks to provide useful data which can be relied upon for making accurate estimates as regards to application performance. When you start judging filesystem performance based on something like MySQL there are simply too many variables involved. So, I appreciate the Benchmark 101, but I'm looking for anyone interested in sharing meat. Most of the existing ZFS benchmarks folks published are several years old now, and most were using IOzone. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Plugging in a hard drive after Solaris has booted up?
On 5/7/10 9:38 PM, Giovanni wrote: > Hi guys, > > I have a quick question, I am playing around with ZFS and here's what I did. > > I created a storage pool with several drives. I unplugged 3 out of 5 drives > from the array, currently: > > NAMESTATE READ WRITE CKSUM > gpool UNAVAIL 0 0 0 insufficient replicas > raidz1UNAVAIL 0 0 0 insufficient replicas > c8t2d0 UNAVAIL 0 0 0 cannot open > c8t4d0 UNAVAIL 0 0 0 cannot open > c8t0d0 UNAVAIL 0 0 0 cannot open > > These drives had power all the time, the SATA cable however was disconnected. > Now, after I logged into Solaris and opened firefox, I plugged them back in > to sit and watch if the storage pool suddenly becomes "available" > > This did not happen, so my question is, do I need to make Solaris re-detect > the hard drives and if so how? I tried format -e but it did not seem to > detect the 3 drives I just plugged back in. Is this a BIOS issue? > > Does hot-swap hard drives only work when you replace current hard drives > (previously detected by BIOS) with others but not when you have ZFS/Solaris > running and want to add more storage without shutting down? > > It all boils down to, say the scenario is that I will need to purchase more > hard drives as my array grows, I would like to be able to (without shutting > down) add the drives to the storage pool (zpool) > There are lots of different things you can look at and do, but it comes down to just one command: "devfsadm -vC". This will cleanup (-C for cleanup, -v for verbose) the device tree if it gets into a funky state. Then run "format" or "iostat -En" to verify that the device(s) are there. Then re-import the zpool or add the device or whatever you wish to do. Even if device locations change, ZFS will do the right thing on import. If you wish to dig deeper... normally when you attach a new device hot-plug will do the right thing and you'll see the connection messages in "dmesg". If you want to explicitly check the state of dynamic reconfiguration, checkout the "cfgadm" command. Normally, however, on modern version of Solaris there is no reason to resort to that, its just something fun if you wish to dig. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirrored Servers
On 5/8/10 3:07 PM, Tony wrote: > Lets say I have two servers, both running opensolaris with ZFS. I basically > want to be able to create a filesystem where the two servers have a common > volume, that is mirrored between the two. Meaning, each server keeps an > identical, real time backup of the other's data directory. Set them both up > as file servers, and load balance between the two for incoming requests. > > How would anyone suggest doing this? > I would carefully consider whether or not the _really_ need to be real time. Can you tolerate 5 minutes or even just 60 seconds of difference between them? If you can, then things are much easier and less complex. I'd personally use ZFS Snapshots to keep the two servers in sync every 60 seconds. As for load balancing, that depends on which protocal your using. FTP is easy. NFS/CIFS is a little harder. I'd simply use a load balancer (Zeus, NetScaler, Balance, HA-Proxy, etc.), but that is a little scary and bizarre in the case of NFS/CIFS, where you should instead use a single-server failover solution, such as Sun Cluster. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard disk buffer at 100%
The drive (c7t2d0)is bad and should be replaced. The second drive (c7t5d0) is either bad or going bad. This is exactly the kind of problem that can force a Thumper to it knees, ZFS performance is horrific, and as soon as you drop the bad disks things magicly return to normal. My first recommendation is to pull the SMART data from the disks if you can. I wrote a blog entry about SMART to address exactly the behavior your seeing back in 2008: http://www.cuddletech.com/blog/pivot/entry.php?id=993 Yes, people will claim that SMART data is useless for predicting failures, but in a case like yours you are just looking for data to corroborate a hypothesis. In order to test this condition, "zpool offline..." c7t2d0, which emulated removal. See if performance improves. On Thumpers I'd build a list of "suspect disks" based on 'iostat', like you show, and then correlate the SMART data, and then systematically offline disks to see if it really was the problem. In my experience the only other reason you'll legitimately see really wierd "bottoming out" of IO like this is if you hit the max conncurrent IO limits in ZFS (untill recently that limit was 35), so you'd see actv=35, and then when the device finally processed the IO's the thing would snap back to life. But even in those cases you shouldn't see request times (asvc_t) rise above 200ms. All that to say, replace those disks or at least test it. SSD's won't help, one or more drives are toast. benr. On 5/8/10 9:30 PM, Emily Grettel wrote: > Hi Giovani, > > Thanks for the reply. > > Here's a bit of iostat after uncompressing a 2.4Gb RAR file that has 1 > DWF file that we use. > > extended device statistics > r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 1.0 13.0 26.0 18.0 0.0 0.00.00.8 0 1 c7t1d0 > 2.05.0 77.0 12.0 2.4 1.0 343.8 142.8 100 100 c7t2d0 > 1.0 16.0 25.5 15.5 0.0 0.00.00.3 0 0 c7t3d0 > 0.0 10.00.0 17.0 0.0 0.03.21.2 1 1 c7t4d0 > 1.0 12.0 25.5 15.5 0.4 0.1 32.4 10.9 14 14 c7t5d0 > 1.0 15.0 25.5 18.0 0.0 0.00.10.1 0 0 c0t1d0 > extended device statistics > r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.00.00.00.0 2.0 1.00.00.0 100 100 c7t2d0 > 1.00.00.50.0 0.0 0.00.00.1 0 0 c7t0d0 > extended device statistics > r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 5.0 15.0 128.0 18.0 0.0 0.00.01.8 0 3 c7t1d0 > 1.09.0 25.5 18.0 2.0 1.8 199.7 179.4 100 100 c7t2d0 > 3.0 13.0 102.5 14.5 0.0 0.10.05.2 0 5 c7t3d0 > 3.0 11.0 102.0 16.5 0.0 0.12.34.2 1 6 c7t4d0 > 1.04.0 25.52.0 0.4 0.8 71.3 158.9 12 79 c7t5d0 > 5.0 16.0 128.5 19.0 0.0 0.10.12.6 0 5 c0t1d0 > extended device statistics > r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.04.00.02.0 2.0 2.0 496.1 498.0 99 100 c7t2d0 > 0.00.00.00.0 0.0 1.00.00.0 0 100 c7t5d0 > extended device statistics > r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 7.00.0 204.50.0 0.0 0.00.00.2 0 0 c7t1d0 > 1.00.0 25.50.0 3.0 1.0 2961.6 1000.0 99 100 c7t2d0 > 8.00.0 282.00.0 0.0 0.00.00.3 0 0 c7t3d0 > 6.00.0 282.50.0 0.0 0.06.12.3 1 1 c7t4d0 > 0.03.00.05.0 0.5 1.0 165.4 333.3 18 100 c7t5d0 > 7.00.0 204.50.0 0.0 0.00.01.6 0 1 c0t1d0 > 2.02.0 89.0 12.0 0.0 0.03.16.1 1 2 c3t0d0 > 0.02.00.0 12.0 0.0 0.00.00.2 0 0 c3t1d0 > > Sometimes two or more disks are going at 100. How does one solve this > issue if its a firmware bug? I tried looking around for Western > Digital Firmware for WD10EADS but couldn't find any available. > > Would adding an SSD or two help here? > > Thanks, > Em > > > Date: Fri, 7 May 2010 14:38:25 -0300 > Subject: Re: [zfs-discuss] ZFS Hard disk buffer at 100% > From: gtirl...@sysdroid.com > To: emilygrettelis...@hotmail.com > CC: zfs-discuss@opensolaris.org > > > On Fri, May 7, 2010 at 8:07 AM, Emily Grettel > mailto:emilygrettelis...@hotmail.com>> > wrote: > > Hi, > > I've had my RAIDz volume working well on SNV_131 but it has come > to my attention that there has been some read issues with the > drives. Previously I thought this was a CIFS problem but I'm > noticing that when transfering files or uncompressing some fairly > large 7z (1-2Gb) files (or even smaller rar - 200-300Mb) files > occasionally running iostat will give th
Re: [zfs-discuss] Opensolaris is apparently dead
On 8/13/10 9:02 PM, "C. Bergström" wrote: > Erast wrote: >> >> >> On 08/13/2010 01:39 PM, Tim Cook wrote: >>> http://www.theregister.co.uk/2010/08/13/opensolaris_is_dead/ >>> >>> I'm a bit surprised at this development... Oracle really just doesn't >>> get it. The part that's most disturbing to me is the fact they >>> won't be >>> releasing nightly snapshots. It appears they've stopped Illumos in its >>> tracks before it really even got started (perhaps that explains the >>> timing of this press release) >> >> Wrong. Be patient, with the pace of current Illumos development it >> soon will have all the closed binaries liberated and ready to sync up >> with promised ON code drops as dictated by GPL and CDDL licenses. > Illumos is just a source tree at this point. You're delusional, > misinformed, or have some big wonderful secret if you believe you have > all the bases covered for a pure open source distribution though.. > > What's closed binaries liberated really mean to you? > > Does it mean >a. You copy over the binary libCrun and continue to use some > version of Sun Studio to build onnv-gate >b. You debug the problems with and start to use ancient gcc-3 (at > the probably expense of performance regressions which most people > would find unacceptable) >c. Your definition is narrow and has missed some closed binaries > > > I think it's great people are still hopeful, working hard and going to > steward this forward, but I wonder.. What pace are you referring to? > The last commit to illumos-gate was 6 days ago and you're already not > even keeping it in sync.. Can you even build it yet and if so where's > the binaries? Illumos is 2 weeks old. Lets cut it a little slack. :) benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
On 8/14/10 1:12 PM, Frank Cusack wrote: > > Wow, what leads you guys to even imagine that S11 wouldn't contain > comstar, etc.? *Of course* it will contain most of the bits that > are current today in OpenSolaris. That's a very good question actually. I would think that COMSTAR would stay because its used by the Fishworks appliance... however, COMSTAR is a competitive advantage for DIY storage solutions. Maybe they will rip it out of S11 and make it an add-on or something. That would suck. I guess the only real reason you can't yank COMSTAR is because its now the basis for iSCSI Target support. But again, there is nothing saying that Target support has to be part of the standard OS offering. Scary to think about. :) benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
If you're still having issues go into the BIOS and disable C-States, if you haven't already. It is responsible for most of the problems with 11th Gen PowerEdge. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to delete hundreds of emtpy snapshots
zfs list is mighty slow on systems with a large number of objects, but there is no foreseeable plan that I'm aware of to solve that "problem". Never the less, you need to do a zfs list, therefore, do it once and work from that. zfs list > /tmp/zfs.out for i in `grep mydataset@ /tmp/zfs.out`; do zfs destroy $i; done As for 5 minute snapshots this is NOT a bad idea. It is, however, complex to manage. Thus, you need to employ tactics to make it more digestible. You need to ask yourself first why you want 5 min snaps. Is it replication? If so, create it, replicate it, destroy all but the last snapshot or even rotate them. Or, is it fallback in case you make a mistake? Then just keep around the last 6 snapshots or so. zfs rename & zfs destroy are your friends use them wisely. :) If you want to discuss exactly what your trying to facilitate I'm sure we can come up with some more concrete ideas to help you. benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ARCSTAT Kstat Definitions
Would someone "in the know" be willing to write up (preferably blog) definitive definitions/explanations of all the arcstats provided via kstat? I'm struggling with proper interpretation of certain values, namely "p", "memory_throttle_count", and the mru/mfu+ghost hit vs demand/prefetch hit counters. I think I've got it figured out, but I'd really like expert clarification before I start tweaking. Thanks. benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ARCSTAT Kstat Definitions
Thanks, not as much as I was hoping for but still extremely helpful. Can you, or others have a look at this: http://cuddletech.com/arc_summary.html This is a PERL script that uses kstats to drum up a report such as the following: System Memory: Physical RAM: 32759 MB Free Memory : 10230 MB LotsFree: 511 MB ARC Size: Current Size: 7989 MB (arcsize) Target Size (Adaptive): 8192 MB (c) Min Size (Hard Limit):1024 MB (zfs_arc_min) Max Size (Hard Limit):8192 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 13%1087 MB (p) Most Frequently Used Cache Size:86%7104 MB (c-p) ARC Efficency: Cache Access Total: 3947194710 Cache Hit Ratio: 99% 3944674329 Cache Miss Ratio: 0% 2520381 Data Demand Efficiency:99% Data Prefetch Efficiency:69% CACHE HITS BY CACHE LIST: Anon:0%16730069 Most Frequently Used: 99%3915830091 (mfu) Most Recently Used: 0%10490502 (mru) Most Frequently Used Ghost: 0%439554 (mfu_ghost) Most Recently Used Ghost:0%1184113 (mru_ghost) CACHE HITS BY DATA TYPE: Demand Data:99%3914527790 Prefetch Data: 0%2447831 Demand Metadata: 0%10709326 Prefetch Metadata: 0%16989382 CACHE MISSES BY DATA TYPE: Demand Data:45%1144679 Prefetch Data: 42%1068975 Demand Metadata: 5%132649 Prefetch Metadata: 6%174078 - Feedback and input is welcome, in particular if I'm mischarrectorizing data. benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ARCSTAT Kstat Definitions
Its a starting point anyway. The key is to try and draw useful conclusions from the info to answer the torrent of "why is my ARC 30GB???" There are several things I'm unclear on whether or not I'm properly interpreting such as: * As you state, the anon pages. Even the comment in code is, to me anyway, a little vague. I include them because otherwise you look at the hit counters and wonder where a large chunk of them went. * Prefetch... I want to use the Prefetch Data hit ratio as a judgment call on the efficiency of prefetch. If the value is very low it might be best to turn it off. but I'd like to hear that from someone else before I go saying that. In high latency environments, such as ZFS on iSCSI, prefetch can either significantly help or hurt, determining which is difficult without some type of metric as as above. * There are several instances (based on dtracing) in which the ARC is bypassed... for ZIL I understand, in some other cases I need to spend more time analyzing the DMU (dbuf_*) for why. * In answering the "Is having a 30GB ARC good?" question, I want to say that if MFU is >60% of ARC, and if the hits are mostly MFU that you are deriving significant benefit from your large ARC but on a system with a 2GB ARC or a 30GB ARC the overall hit ratio tends to be 99%. Which is nuts, and tends to reinforce a misinterpretation of anon hits. The only way I'm seeing to _really_ understand ARC's efficiency is to look at the overall number of reads and then how many are intercepted by ARC and how many actually made it to disk... and why (prefetch or demand). This is tricky to implement via kstats because you have to pick out and monitor the zpool disks themselves. I've spent a lot of time in this code (arc.c) and still have a lot of questions. I really wish there was an "Advanced ZFS Internals" talk coming up; I simply can't keep spending so much time on this. Feedback from PAE or other tuning experts is welcome and appreciated. :) benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ARCSTAT Kstat Definitions
New version is available (v0.2) : * Fixes divide by zero, * includes tuning from /etc/system in output * if prefetch is disabled I explicitly say so. * Accounts for jacked anon count. Still need improvement here. * Added friendly explanations for MRU/MFU & Ghost lists counts. Page and examples are updated: cuddletech.com/arc_summary.pl Still needs work, but hopefully interest in this will stimulate some improved understanding of ARC internals. benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Lost Disk Space
I've been struggling to fully understand why disk space seems to vanish. I've dug through bits of code and reviewed all the mails on the subject that I can find, but I still don't have a proper understanding of whats going on. I did a test with a local zpool on snv_97... zfs list, zpool list, and zdb all seem to disagree on how much space is available. In this case its only a discrepancy of about 20G or so, but I've got Thumpers that have a discrepancy of over 6TB! Can someone give a really detailed explanation about whats going on? block traversal size 670225837056 != alloc 720394438144 (leaked 50168601088) bp count:15182232 bp logical:672332631040 avg: 44284 bp physical: 669020836352 avg: 44066compression: 1.00 bp allocated: 670225837056 avg: 44145compression: 1.00 SPA allocated: 720394438144 used: 96.40% Blocks LSIZE PSIZE ASIZE avgcomp %Total Type 12 120K 26.5K 79.5K 6.62K4.53 0.00 deferred free 1512 512 1.50K 1.50K1.00 0.00 object directory 3 1.50K 1.50K 4.50K 1.50K1.00 0.00 object array 116K 1.50K 4.50K 4.50K 10.67 0.00 packed nvlist - - - - - -- packed nvlist size 72 8.45M889K 2.60M 37.0K9.74 0.00 bplist - - - - - -- bplist header - - - - - -- SPA space map header 974 4.48M 2.65M 7.94M 8.34K1.70 0.00 SPA space map - - - - - -- ZIL intent log 96.7K 1.51G389M777M 8.04K3.98 0.12 DMU dnode 17 17.0K 8.50K 17.5K 1.03K2.00 0.00 DMU objset - - - - - -- DSL directory 13 6.50K 6.50K 19.5K 1.50K1.00 0.00 DSL directory child map 12 6.00K 6.00K 18.0K 1.50K1.00 0.00 DSL dataset snap map 14 38.0K 10.0K 30.0K 2.14K3.80 0.00 DSL props - - - - - -- DSL dataset - - - - - -- ZFS znode 2 1K 1K 2K 1K1.00 0.00 ZFS V0 ACL 5.81M 558G557G557G 95.8K1.0089.27 ZFS plain file 382K 301M200M401M 1.05K1.50 0.06 ZFS directory 9 4.50K 4.50K 9.00K 1K1.00 0.00 ZFS master node 12 482K 20.0K 40.0K 3.33K 24.10 0.00 ZFS delete queue 8.20M 66.1G 65.4G 65.8G 8.03K1.0110.54 zvol object 1512 512 1K 1K1.00 0.00 zvol prop - - - - - -- other uint8[] - - - - - -- other uint64[] - - - - - -- other ZAP - - - - - -- persistent error log 1 128K 10.5K 31.5K 31.5K 12.19 0.00 SPA history - - - - - -- SPA history offsets - - - - - -- Pool properties - - - - - -- DSL permissions - - - - - -- ZFS ACL - - - - - -- ZFS SYSACL - - - - - -- FUID table - - - - - -- FUID table size 5 3.00K 2.50K 7.50K 1.50K1.20 0.00 DSL dataset next clones - - - - - -- scrub work queue 14.5M 626G623G624G 43.1K1.00 100.00 Total real21m16.862s user0m36.984s sys 0m5.757s === Looking at the data: [EMAIL PROTECTED] ~$ zfs list backup && zpool list backup NAME USED AVAIL REFER MOUNTPOINT backup 685G 237K27K /backup NAME SIZE USED AVAILCAP HEALTH ALTROOT backup 696G 671G 25.1G96% ONLINE - So zdb says 626GB is used, zfs list says 685GB is used, and zpool list says 671GB is used. The pool was filled to 100% capacity via dd, this is confirmed, I can't write data, but yet zpool list says its only 96%. benr. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lost Disk Space
No takers? :) benr. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zdb to dump data
Is there some hidden way to coax zdb into not just displaying data based on a given DVA but rather to dump it in raw usable form? I've got a pool with large amounts of corruption. Several directories are toast and I get "I/O Error" when trying to enter or read the directory... however I can read the directory and files using ZDB, if I could just dump it in a raw format I could do recovery that way. To be clear, I've already recovered from the situation, this is purely an academic "can I do it" exercise for the sake of learning. If ZDB can't do it, I'd assume I'd have to write some code to read based on DVA. Maybe I could write a little tool for it. benr. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [Fwd: Re: [perf-discuss] ZFS performance issue - READ is slow as hell...]
Ya, I agree that we need some additional data and testing. The iostat data in itself doesn't suggest to me that the process (dd) is slow but rather that most of the data is being retrieved elsewhere (ARC). An fsstat would be useful to correlate with the iostat data. One thing that also comes to mind with streaming write performance is the effects of the write throttle... curious if he'd have gotten more on the write side with that disabled. All these things don't strike me particularly as bugs (although there is always improvement) but rather that ZFS is designed for real world environments, not antiquated benchmarks. benr. Jim Mauro wrote: > > Posting this back to zfs-discuss. > > Roland's test case (below) is a single threaded sequential write > followed by a single threaded sequential read. His bandwidth > goes from horrible (~2MB/sec) to expected (~30MB/sec) > when prefetch is disabled. This is with relatively recent nv bits > (nv110). > > Roland - I'm wondering if you were tripping over > CR6732803 ZFS prefetch creates performance issues for streaming > workloads. > It seems possible, but that CR is specific about multiple, concurrent > IO streams, > and your test case was only one. > > I think it's more likely you were tripping over > CR6412053 zfetch needs a whole lotta love. > > For both CR's the workaround is disabling prefetch > (echo "zfs_prefetch_disable/W 1" | mdb -kw) > > Any other theories on this test case? > > Thanks, > /jim > > > Original Message > Subject: Re: [perf-discuss] ZFS performance issue - READ is slow > as hell... > Date: Tue, 31 Mar 2009 02:33:00 -0700 (PDT) > From: roland > To: perf-disc...@opensolaris.org > > > > Hello Jim, > i double checked again - but it`s like i told: > > echo zfs_prefetch_disable/W0t1 | mdb -kw > fixes my problem. > > i did a reboot and only set this single param - which immediately > makes the read troughput go up from ~2 MB/s to ~30 MB/s > >> I don't understand why disabling ZFS prefetch solved this >> problem. The test case was a single threaded sequential write, followed >> by a single threaded sequential read. > > i did not even do a single write - after reboot i just did > dd if=/zfs/TESTFILE of=/dev/null > > Solaris Express Community Edition snv_110 X86 > FSC RX300 S2 > 4GB RAM > LSI Logic MegaRaid 320 Onboard SCSI Raid Controller > 1x Raid1 LUN > 1x Raid5 LUN (3 Disks) > (both LUN`s show same behaviour) > > > before: > extended device statistics > r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 21.30.1 2717.60.1 0.7 0.0 31.81.7 2 4 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 > 16.00.0 2048.40.0 34.9 0.1 2181.84.8 100 3 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 > 28.00.0 3579.20.0 34.8 0.1 1246.24.9 100 5 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 > 45.00.0 5760.40.0 34.8 0.2 772.74.5 100 7 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 > 19.00.0 2431.90.0 34.9 0.1 1837.34.4 100 3 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 > 58.00.0 7421.10.0 34.6 0.3 597.45.8 100 12 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 >0.00.00.00.0 35.0 0.00.00.0 100 0 c0t1d0 > > > after: > extended device statistics > r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 218.00.0 27842.30.0 0.0 0.40.11.8 1 40 c0t1d0 > 241.00.0 30848.00.0 0.0 0.40.01.6 0 38 c0t1d0 > 237.00.0 30340.10.0 0.0 0.40.01.6 0 38 c0t1d0 > 230.00.0 29434.70.0 0.0 0.40.01.8 0 40 c0t1d0 > 238.10.0 30471.30.0 0.0 0.40.01.5 0 37 c0t1d0 > 234.90.0
Re: [zfs-discuss] zfs / nfs issue (not performance :-) with courier-imap
Robert Milkowski wrote: CLSNL> but if I click, say E, it has F's contents, F has Gs contents, and no CLSNL> mail has D's contents that I can see. But the list in the mail CLSNL> client list view is correct. I don't belive it's a problem with nfs/zfs server. Please try with simple dtrace script to see (or even truss) what files your imapd actually opens when you click E - I don't belive it opens E and you get F contents, I would bet it opens F. I completely agree with Robert. I'd personally suggest 'truss' to start because its trivial to use, then start using DTrace to further hone down the problem. In the case of Courier-IMAP the best way to go about it would be to truss the parent (courierlogger, which calls courierlogin and ultimately imapd) using 'truss -f -p '. Then open the mailbox and watch those stat's and open's closely. I'll be very interested in your findings. We use Courier on NFS/ZFS heavily and I'm thankful to report having no such problems. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Read Only Zpool: ZFS and Replication
I've been playing with replication of a ZFS Zpool using the recently released AVS. I'm pleased with things, but just replicating the data is only part of the problem. The big question is: can I have a zpool open in 2 places? What I really want is a Zpool on node1 open and writable (production storage) and a replicated to node2 where its open for read-only access (standby storage). This is an old problem. I'm not sure its remotely possible. Its bad enough with UFS, but ZFS maintains a hell of a lot more meta-data. How is node2 supposed to know that a snapshot has been created for instance. With UFS you can at least get by some of these problems using directio, but thats not an option with a zpool. I know this is a fairly remedial issue to bring up... but if I think about what I want Thumper-to-Thumper replication to look like, I want 2 usable storage systems. As I see it now the secondary storage (node2) is useless untill you break replication and import the pool, do your thing, and then re-sync storage to re-enable replication. Am I missing something? I'm hoping there is an option I'm not aware of. benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] snapdir visable recursively throughout a dataset
Is there an existing RFE for, what I'll wrongly call, "recursively visable snapshots"? That is, .zfs in directories other than the dataset root. Frankly, I don't need it available in all directories, although it'd be nice, but I do have a need for making it visiable 1 dir down from the dataset root. The problem is that while ZFS and Zones work smoothly together for moving, cloning, sizing, etc, you can't view .zfs/ from within the zone because the zone root is one dir down: /zones <-- Dataset /zones/myzone01 <-- Dataset, .zfs is located here. /zones/myzone01/root <-- Directory, want .zfs Here! The ultimate idea is to make ZFS snapdirs accessable from within the zone. benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Read Only Zpool: ZFS and Replication
Jim Dunham wrote: Robert, Hello Ben, Monday, February 5, 2007, 9:17:01 AM, you wrote: BR> I've been playing with replication of a ZFS Zpool using the BR> recently released AVS. I'm pleased with things, but just BR> replicating the data is only part of the problem. The big BR> question is: can I have a zpool open in 2 places? BR> What I really want is a Zpool on node1 open and writable BR> (production storage) and a replicated to node2 where its open for BR> read-only access (standby storage). BR> This is an old problem. I'm not sure its remotely possible. Its BR> bad enough with UFS, but ZFS maintains a hell of a lot more BR> meta-data. How is node2 supposed to know that a snapshot has been BR> created for instance. With UFS you can at least get by some of BR> these problems using directio, but thats not an option with a zpool. BR> I know this is a fairly remedial issue to bring up... but if I BR> think about what I want Thumper-to-Thumper replication to look BR> like, I want 2 usable storage systems. As I see it now the BR> secondary storage (node2) is useless untill you break replication BR> and import the pool, do your thing, and then re-sync storage to re-enable replication. BR> Am I missing something? I'm hoping there is an option I'm not aware of. You can't mount rw on one node and ro on another (not to mention that zfs doesn't offer you to import RO pools right now). You can mount the same file system like UFS in RO on both nodes but not ZFS (no ro import). One can not just mount a filesystem in RO mode if SNDR or any other host-based or controller-based replication is underneath. For all filesystems that I know of, expect of course shared-reader QFS, this will fail given time. Even if one has the means to mount a filesystem with DIRECTIO (no-caching), READ-ONLY (no-writes), it does not prevent a filesystem from looking at the contents of block "A" and then acting on block "B". The reason being is that during replication at time T1 both blocks "A" & "B" could be written and be consistent with each other. Next the file system reads block "A". Now replication at time T2 updates blocks "A" & "B", also consistent with each other. Next the file system reads block "B" and panics due to an inconsistency only it sees between old "A" and new "B". I know this for a fact, since a forced "zpool import -f ", is a common instance of this exact failure, due most likely checksum failures between metadata blocks "A" & "B". Ya, that bit me last night. 'zpool import' shows the pool fine, but when you force the import you panic: Feb 5 07:14:10 uma ^Mpanic[cpu0]/thread=fe8001072c80: Feb 5 07:14:10 uma genunix: [ID 809409 kern.notice] ZFS: I/O failure (write on off 0: zio fe80c54ed380 [L0 unallocated] 400L/200P DVA[0]=<0:36000:200> DVA[1]=<0:9c0003800:200> DVA[2]=<0:20004e00:200> fletcher4 lzjb LE contiguous birth=57416 fill=0 cksum=de2e56ffd:5591b77b74b:1101a91d58dfc:252efdf22532d0): error 5 Feb 5 07:14:11 uma unix: [ID 10 kern.notice] Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072a40 zfs:zio_done+140 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072a60 zfs:zio_next_stage+68 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072ab0 zfs:zio_wait_for_children+5d () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072ad0 zfs:zio_wait_children_done+20 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072af0 zfs:zio_next_stage+68 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072b40 zfs:zio_vdev_io_assess+129 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072b60 zfs:zio_next_stage+68 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072bb0 zfs:vdev_mirror_io_done+2af () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072bd0 zfs:zio_vdev_io_done+26 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072c60 genunix:taskq_thread+1a7 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072c70 unix:thread_start+8 () Feb 5 07:14:11 uma unix: [ID 10 kern.notice] So without using II, whats the best method of bring up the secondary storage? Is just dropping the primary into logging acceptable? benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snapdir visable recursively throughout a dataset
Robert Milkowski wrote: I haven't tried it but what if you mounted ro via loopback into a zone /zones/myzone01/root/.zfs is loop mounted in RO to /zones/myzone01/.zfs That is so wrong. ;) Besides just being evil, I doubt it'd work. And if it does, it probly shouldn't. I think I'm the only one that gets a rash when using LOFI. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snapdir visable recursively throughout a dataset
Darren J Moffat wrote: Ben Rockwood wrote: Robert Milkowski wrote: I haven't tried it but what if you mounted ro via loopback into a zone /zones/myzone01/root/.zfs is loop mounted in RO to /zones/myzone01/.zfs That is so wrong. ;) Besides just being evil, I doubt it'd work. And if it does, it probly shouldn't. I think I'm the only one that gets a rash when using LOFI. lofi or lofs ? lofi - Loopback file driver Makes a block device from a file lofs - loopback virtual file system Makes a file system from a file system Yes, I know. I was referring more so to loopback happy people in general. :) benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making 'zfs destroy' safer
Peter Schuller wrote: Hello, with the advent of clones and snapshots, one will of course start creating them. Which also means destroying them. Am I the only one who is *extremely* nervous about doing "zfs destroy some/[EMAIL PROTECTED]"? This goes bot manually and automatically in a script. I am very paranoid about this; especially because the @ sign might conceivably be incorrectly interpreted by some layer of scripting, being a non-alphanumeric character and highly atypical for filenames/paths. What about having dedicated commands "destroysnapshot", "destroyclone", or "remove" (less dangerous variant of "destroy") that will never do anything but remove snapshots or clones? Alternatively having something along the lines of "zfs destroy --nofs" or "zfs destroy --safe". I realize this is borderline being in the same territory as special casing "rm -rf /" and similar, which is generally not considered a good idea. But somehow the snapshot situation feels a lot more risky. This isn't the first time this subject has come up. You are definitely NOT alone. The problem is only compounded when doing recursive actions. The general request has been for a confirmation "Are you sure?" which could be over-ridden with a -f. The general response is "if run from a script everyone will use -f and defeat the purpose". The suggestions you've come up with above are very good ones. I think the addition of "destroysnap" or "destroyclone" are particularly good because they could be added without conflicting with or changing the existing interfaces. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New zfs pr0n server :)))
Diego Righi wrote: Hi all, I just built a new zfs server for home and, being a long time and avid reader of this forum, I'm going to post my config specs and my benchmarks hoping this could be of some help for others :) http://www.sickness.it/zfspr0nserver.jpg http://www.sickness.it/zfspr0nserver.txt http://www.sickness.it/zfspr0nserver.png http://www.sickness.it/zfspr0nserver.pdf Correct me if I'm wrong: from the benchmark results, I understand that this setup is slow at writing, but fast at reading (and this is perfect for my usage, copying large files once and then accessing only to read them). It also seems that at 128kb it gives the best performances, iirc due to the zfs stripe size (again, correct me if I'm wrong :). I'd happily try any other test, but if you suggest bonnie++ please tell me what's the right version to use, too much of them I really can't understand which to try! tnx :) Classy. +1 for style. ;) benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZVol Panic on 62
May 25 23:32:59 summer unix: [ID 836849 kern.notice] May 25 23:32:59 summer ^Mpanic[cpu1]/thread=1bf2e740: May 25 23:32:59 summer genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf Page fault) rp=ff00232c3a80 addr=490 occurred in module "unix" due to a NULL pointer dereference May 25 23:32:59 summer unix: [ID 10 kern.notice] May 25 23:32:59 summer unix: [ID 839527 kern.notice] grep: May 25 23:32:59 summer unix: [ID 753105 kern.notice] #pf Page fault May 25 23:32:59 summer unix: [ID 532287 kern.notice] Bad kernel fault at addr=0x490 May 25 23:32:59 summer unix: [ID 243837 kern.notice] pid=18425, pc=0xfb83b6bb, sp=0xff00232c3b78, eflags=0x10246 May 25 23:32:59 summer unix: [ID 211416 kern.notice] cr0: 8005003b cr4: 6f8 May 25 23:32:59 summer unix: [ID 354241 kern.notice] cr2: 490 cr3: 1fce52000 cr8: c May 25 23:32:59 summer unix: [ID 592667 kern.notice]rdi: 490 rsi:0 rdx: 1bf2e740 May 25 23:32:59 summer unix: [ID 592667 kern.notice]rcx:0 r8:d r9: 62ccc700 May 25 23:32:59 summer unix: [ID 592667 kern.notice]rax:0 rbx:0 rbp: ff00232c3bd0 May 25 23:32:59 summer unix: [ID 592667 kern.notice]r10: fc18 r11:0 r12: 490 May 25 23:32:59 summer unix: [ID 592667 kern.notice]r13: 450 r14: 52e3aac0 r15:0 May 25 23:32:59 summer unix: [ID 592667 kern.notice]fsb:0 gsb: fffec3731800 ds: 4b May 25 23:32:59 summer unix: [ID 592667 kern.notice] es: 4b fs:0 gs: 1c3 May 25 23:33:00 summer unix: [ID 592667 kern.notice]trp:e err:2 rip: fb83b6bb May 25 23:33:00 summer unix: [ID 592667 kern.notice] cs: 30 rfl:10246 rsp: ff00232c3b78 May 25 23:33:00 summer unix: [ID 266532 kern.notice] ss: 38 May 25 23:33:00 summer unix: [ID 10 kern.notice] May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3960 unix:die+c8 () May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3a70 unix:trap+135b () May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3a80 unix:cmntrap+e9 () May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3bd0 unix:mutex_enter+b () May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3c20 zfs:zvol_read+51 () May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3c50 genunix:cdev_read+3c () May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3cd0 specfs:spec_read+276 () May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3d40 genunix:fop_read+3f () May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3e90 genunix:read+288 () May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3ec0 genunix:read32+1e () May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3f10 unix:brand_sys_syscall32+1a3 () May 25 23:33:00 summer unix: [ID 10 kern.notice] May 25 23:33:00 summer genunix: [ID 672855 kern.notice] syncing file systems... Does anyone have an idea of what bug this might be? Occurred on X86 B62. I'm not seeing any putbacks into 63 or bugs that seem to match. Any insight is appreciated. Core's are available. benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, iSCSI + Mac OS X Tiger (globalSAN iSCSI)
George wrote: > I have set up an iSCSI ZFS target that seems to connect properly from > the Microsoft Windows initiator in that I can see the volume in MMC > Disk Management. > > > When I shift over to Mac OS X Tiger with globalSAN iSCSI, I am able to > set up the Targets with the target name shown by `iscsitadm list > target` and when I actually connect or "Log On" I see that one > connection exists on the Solaris server. I then go on to the Sessions > tab in globalSAN and I see the session details and it appears that > data is being transferred via the PDUs Sent, PDUs Received, Bytes, > etc. HOWEVER the connection then appears to terminate on the Solaris > side if I check it a few minutes later it shows no connections, but > the Mac OS X initiator still shows connected although no more traffic > appears to be flowing in the Session Statistics dialog area. > > > Additionally, when I then disconnect the Mac OS X initiator it seems > to drop fine on the Mac OS X side, even though the Solaris side has > shown it gone for a while, however when I reconnect or Log On again, > it seems to spin infinitely on the "Target Connect..." dialog. > Solaris is, interestingly, showing 1 connection while this apparent > issue (spinning beachball of death) is going on with globalSAN. Even > killing the Mac OS X process doesn't seem to get me full control again > as I have to restart the system to kill all processes (unless I can > hunt them down and `kill -9` them which I've not successfully done > thus far). > > Has anyone dealt with this before and perhaps be able to assist or at > least throw some further information towards me to troubleshoot this? When I learned of the globalSAN Initiator I was overcome with joy. after about 2 days of spending way too much time with it I gave up. Have a look at their forum (http://www.snsforums.com/index.php?s=b0c9031ebe1a89a40cfe4c417e3443f1&showforum=14). There are a wide range of problems. In my case connections to the target (Solaris/ZFS/iscsitgt) look fine and dandy initially, but you can use the connection, on reboot globalSAN goes psycho, etc. At this point I've given up on the product; at least for now. If I could actually get an accessable disk at least part of the time I'd dig my fingers into it, but it doesn't offer a usable remote disk to begin with and in a variety of other environments it have identical problems. I consider debugging it to be purely academic at this point. Its a great way to gain insight into the inner workings of iSCSI, but without source code or DTrace on the Mac its hard to expect any big gains. Thats my personal take. If you really wanna go hacking on it regardless bring it up on the Storage list and we can corporately enjoy the academic challenge of finding the problems, but there is nothing to suggest its an OpenSolaris issue. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When I stab myself with this knife, it hurts... But - should it kill me?
Dick Davies wrote: > On 04/10/2007, Nathan Kroenert <[EMAIL PROTECTED]> wrote: > > >> Client A >> - import pool make couple-o-changes >> >> Client B >> - import pool -f (heh) >> > > >> Oct 4 15:03:12 fozzie ^Mpanic[cpu0]/thread=ff0002b51c80: >> Oct 4 15:03:12 fozzie genunix: [ID 603766 kern.notice] assertion >> failed: dmu_read(os, smo->smo_object, offset, size, entry_map) == 0 (0x5 >> == 0x0) >> , file: ../../common/fs/zfs/space_map.c, line: 339 >> Oct 4 15:03:12 fozzie unix: [ID 10 kern.notice] >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51160 >> genunix:assfail3+b9 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51200 >> zfs:space_map_load+2ef () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51240 >> zfs:metaslab_activate+66 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51300 >> zfs:metaslab_group_alloc+24e () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b513d0 >> zfs:metaslab_alloc_dva+192 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51470 >> zfs:metaslab_alloc+82 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514c0 >> zfs:zio_dva_allocate+68 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514e0 >> zfs:zio_next_stage+b3 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51510 >> zfs:zio_checksum_generate+6e () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51530 >> zfs:zio_next_stage+b3 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515a0 >> zfs:zio_write_compress+239 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515c0 >> zfs:zio_next_stage+b3 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51610 >> zfs:zio_wait_for_children+5d () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51630 >> zfs:zio_wait_children_ready+20 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51650 >> zfs:zio_next_stage_async+bb () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51670 >> zfs:zio_nowait+11 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51960 >> zfs:dbuf_sync_leaf+1ac () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b519a0 >> zfs:dbuf_sync_list+51 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a10 >> zfs:dnode_sync+23b () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a50 >> zfs:dmu_objset_sync_dnodes+55 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51ad0 >> zfs:dmu_objset_sync+13d () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51b40 >> zfs:dsl_pool_sync+199 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51bd0 >> zfs:spa_sync+1c5 () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c60 >> zfs:txg_sync_thread+19a () >> Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c70 >> unix:thread_start+8 () >> Oct 4 15:03:12 fozzie unix: [ID 10 kern.notice] >> > > >> Is this a known issue, already fixed in a later build, or should I bug it? >> > > It shouldn't panic the machine, no. I'd raise a bug. > > >> After spending a little time playing with iscsi, I have to say it's >> almost inevitable that someone is going to do this by accident and panic >> a big box for what I see as no good reason. (though I'm happy to be >> educated... ;) >> > > You use ACLs and TPGT groups to ensure 2 hosts can't simultaneously > access the same LUN by accident. You'd have the same problem with > Fibre Channel SANs. > I ran into similar problems when replicating via AVS. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for OSX - it'll be in there.
Dale Ghent wrote: > ...and eventually in a read-write capacity: > > http://www.macrumors.com/2007/10/04/apple-seeds-zfs-read-write- > developer-preview-1-1-for-leopard/ > > Apple has seeded version 1.1 of ZFS (Zettabyte File System) for Mac > OS X to Developers this week. The preview updates a previous build > released on June 26, 2007. > Y! Finally my USB Thumb Drives will work on my MacBook! :) I wonder if it'll automatically mount the Zpool on my iPod when I sync it. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Quota Oddness
I've run across an odd issue with ZFS Quota's. This is an snv_43 system with several zones/zfs datasets, but only one effected. The dataset shows 10GB used, 12GB refered but when counting the files only has 6.7GB of data: zones/ABC10.8G 26.2G 12.0G /zones/ABC zones/[EMAIL PROTECTED]14.7M - 12.0G - [xxx:/zones/ABC/.zfs/snapshot/now] root# gdu --max-depth=1 -h . 43k ./dev 6.7G./root 1.5k./lu 6.7G. I don't understand what might the cause this disparity. This is an older box, snv_43. Any bugs that might apply, fixed or in progress? Thanks. benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Panic on Zpool Import (Urgent)
Today, suddenly, without any apparent reason that I can find, I'm getting panic's during zpool import. The system paniced earlier today and has been suffering since. This is snv_43 on a thumper. Here's the stack: panic[cpu0]/thread=99adbac0: assertion failed: ss != NULL, file: ../../common/fs/zfs/space_map.c, line: 145 fe8000a240a0 genunix:assfail+83 () fe8000a24130 zfs:space_map_remove+1d6 () fe8000a24180 zfs:space_map_claim+49 () fe8000a241e0 zfs:metaslab_claim_dva+130 () fe8000a24240 zfs:metaslab_claim+94 () fe8000a24270 zfs:zio_dva_claim+27 () fe8000a24290 zfs:zio_next_stage+6b () fe8000a242b0 zfs:zio_gang_pipeline+33 () fe8000a242d0 zfs:zio_next_stage+6b () fe8000a24320 zfs:zio_wait_for_children+67 () fe8000a24340 zfs:zio_wait_children_ready+22 () fe8000a24360 zfs:zio_next_stage_async+c9 () fe8000a243a0 zfs:zio_wait+33 () fe8000a243f0 zfs:zil_claim_log_block+69 () fe8000a24520 zfs:zil_parse+ec () fe8000a24570 zfs:zil_claim+9a () fe8000a24750 zfs:dmu_objset_find+2cc () fe8000a24930 zfs:dmu_objset_find+fc () fe8000a24b10 zfs:dmu_objset_find+fc () fe8000a24bb0 zfs:spa_load+67b () fe8000a24c20 zfs:spa_import+a0 () fe8000a24c60 zfs:zfs_ioc_pool_import+79 () fe8000a24ce0 zfs:zfsdev_ioctl+135 () fe8000a24d20 genunix:cdev_ioctl+55 () fe8000a24d60 specfs:spec_ioctl+99 () fe8000a24dc0 genunix:fop_ioctl+3b () fe8000a24ec0 genunix:ioctl+180 () fe8000a24f10 unix:sys_syscall32+101 () syncing file systems... done This is almost identical to a post to this list over a year ago titled "ZFS Panic". There was follow up on it but the results didn't make it back to the list. I spent time doing a full sweep for any hardware failures, pulled 2 drives that I suspected as problematic but weren't flagged as such, etc, etc, etc. Nothing helps. Bill suggested a 'zpool import -o ro' on the other post, but thats not working either. I _can_ use 'zpool import' to see the pool, but I have to force the import. A simple 'zpool import' returns output in about a minute. 'zpool import -f poolname' takes almost exactly 10 minutes every single time, like it hits some timeout and then panics. I did notice that while the 'zpool import' is running 'iostat' is useless, just hangs. I still want to believe this is some device misbehaving but I have no evidence to support that theory. Any and all suggestions are greatly appreciated. I've put around 8 hours into this so far and I'm getting absolutely nowhere. Thanks benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Removing An Errant Drive From Zpool
I made a really stupid mistake... having trouble removing a hot spare marked as failed I was trying several ways to put it back in a good state. One means I tried was to 'zpool add pool c5t3d0'... but I forgot to use the proper syntax "zpool add pool spare c5t3d0". Now I'm in a bind. I've got 4 large raidz2's and now this punty 500GB drive in the config: ... raidz2ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c5t3d0ONLINE 0 0 0 spares c5t3d0FAULTED corrupted data c4t7d0AVAIL ... Detach and Remove won't work. Does anyone know of a way to get that c5t3d0 out of the data configuration and back to hot-spare where it belongs? However if I understand the layout properly, this should not have an adverse impact on my existing configuration I think. If I can't dump it, what happens when that disk fills up? I can't believe I made such a bone headed mistake. This is one of those times when a "Are you sure you...?" would be helpful. :( benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Removing An Errant Drive From Zpool
Eric Schrock wrote: > There's really no way to recover from this, since we don't have device > removal. However, I'm suprised that no warning was given. There are at > least two things that should have happened: > > 1. zpool(1M) should have warned you that the redundancy level you were >attempting did not match that of your existing pool. This doesn't >apply if you already have a mixed level of redundancy. > > 2. zpool(1M) should have warned you that the device was in use as an >active spare and not let you continue. > > What bits were you running? > snv_78, however the pool was created on snv_43 and hasn't yet been upgraded. Though, programatically, I can't see why there would be a difference in the way 'zpool' would handle the check. The big question is, if I'm stuck like the permanently, whats the potential risk? Could I potentially just fail that drive and leave it in a failed state? benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Removing An Errant Drive From Zpool
Robert Milkowski wrote: > If you can't re-create a pool (+backup&restore your data) I would > recommend to wait for device removal in zfs and in a mean time I would > attach another drive to it so you've got mirrored configuration and > remove them once there's a device removal. Since you're already > working on nevada you probably could adopt new bits quickly. > > The only question is - when device removal is going to be integrated - > last time someone mentioned it here it was supposed to be by the end > of last year... > Ya, I'm afraid your right. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Panic on Zpool Import (Urgent)
The solution here was to upgrade to snv_78. By "upgrade" I mean re-jumpstart the system. I tested snv_67 via net-boot but the pool paniced just as below. I also attempted using zfs_recover without success. I then tested snv_78 via net-boot, used both "aok=1" and "zfs:zfs_recover=1" and was able to (slowly) import the pool. Following that test I exported and then did a full re-install of the box. A very important note to anyone upgrading a Thumper! Don't forget about the NCQ bug. After upgrading to a release more recent than snv_60 add the following to /etc/system: set sata:sata_max_queue_depth = 0x1 If you don't life will be highly unpleasant and you'll believe that disks are failing everywhere when in fact they are not. benr. Ben Rockwood wrote: > Today, suddenly, without any apparent reason that I can find, I'm > getting panic's during zpool import. The system paniced earlier today > and has been suffering since. This is snv_43 on a thumper. Here's the > stack: > > panic[cpu0]/thread=99adbac0: assertion failed: ss != NULL, file: > ../../common/fs/zfs/space_map.c, line: 145 > > fe8000a240a0 genunix:assfail+83 () > fe8000a24130 zfs:space_map_remove+1d6 () > fe8000a24180 zfs:space_map_claim+49 () > fe8000a241e0 zfs:metaslab_claim_dva+130 () > fe8000a24240 zfs:metaslab_claim+94 () > fe8000a24270 zfs:zio_dva_claim+27 () > fe8000a24290 zfs:zio_next_stage+6b () > fe8000a242b0 zfs:zio_gang_pipeline+33 () > fe8000a242d0 zfs:zio_next_stage+6b () > fe8000a24320 zfs:zio_wait_for_children+67 () > fe8000a24340 zfs:zio_wait_children_ready+22 () > fe8000a24360 zfs:zio_next_stage_async+c9 () > fe8000a243a0 zfs:zio_wait+33 () > fe8000a243f0 zfs:zil_claim_log_block+69 () > fe8000a24520 zfs:zil_parse+ec () > fe8000a24570 zfs:zil_claim+9a () > fe8000a24750 zfs:dmu_objset_find+2cc () > fe8000a24930 zfs:dmu_objset_find+fc () > fe8000a24b10 zfs:dmu_objset_find+fc () > fe8000a24bb0 zfs:spa_load+67b () > fe8000a24c20 zfs:spa_import+a0 () > fe8000a24c60 zfs:zfs_ioc_pool_import+79 () > fe8000a24ce0 zfs:zfsdev_ioctl+135 () > fe8000a24d20 genunix:cdev_ioctl+55 () > fe8000a24d60 specfs:spec_ioctl+99 () > fe8000a24dc0 genunix:fop_ioctl+3b () > fe8000a24ec0 genunix:ioctl+180 () > fe8000a24f10 unix:sys_syscall32+101 () > > syncing file systems... done > > This is almost identical to a post to this list over a year ago titled > "ZFS Panic". There was follow up on it but the results didn't make it > back to the list. > > I spent time doing a full sweep for any hardware failures, pulled 2 > drives that I suspected as problematic but weren't flagged as such, etc, > etc, etc. Nothing helps. > > Bill suggested a 'zpool import -o ro' on the other post, but thats not > working either. > > I _can_ use 'zpool import' to see the pool, but I have to force the > import. A simple 'zpool import' returns output in about a minute. > 'zpool import -f poolname' takes almost exactly 10 minutes every single > time, like it hits some timeout and then panics. > > I did notice that while the 'zpool import' is running 'iostat' is > useless, just hangs. I still want to believe this is some device > misbehaving but I have no evidence to support that theory. > > Any and all suggestions are greatly appreciated. I've put around 8 > hours into this so far and I'm getting absolutely nowhere. > > Thanks > > benr. > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS and ACL's over NFSv3
Can someone please clarify the ability to utilize ACL's over NFSv3 from a ZFS share? I can "getfacl" but I can't "setfacl". I can't find any documentation in this regard. My suspicion is that that ZFS Shares must be NFSv4 in order to utilize ACLs but I'm hoping this isn't the case. Can anyone definitively speak to this? The closest related bug I can find is 6340720 which simply says "See comments." benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] 40min ls in empty directory
I've run into an odd problem which I lovingly refer to as a "black hole directory". On a Thumper used for mail stores we've found find's take an exceptionally long time to run. There are directories that have as many as 400,000 files, which I immediately considered the culprit. However, under investigation, they aren't the problem at all. The problem is seen here in this truss output (first column is delta time): 0.0001 lstat64("tmp", 0x08046A20) = 0 0. openat(AT_FDCWD, "tmp", O_RDONLY|O_NDELAY|O_LARGEFILE) = 8 0.0001 fcntl(8, F_SETFD, 0x0001) = 0 0. fstat64(8, 0x08046920) = 0 0. fstat64(8, 0x08046AB0) = 0 0. fchdir(8) = 0 1321.3133 getdents64(8, 0xFEE48000, 8192) = 48 1255.8416 getdents64(8, 0xFEE48000, 8192) = 0 0.0001 fchdir(7) = 0 0.0001 close(8)= 0 These two getdents64 syscalls take approx 20 mins each. Notice that the directory structure is 48 bytes, the directory is empty: drwx-- 2 102 1022 Feb 21 02:24 tmp My assumption is that the directory is corrupt, but I'd like to prove that. I have a scrub running on the pool, but its got about 16 hours to go before it completes. 20% complete thus far and nothing is reported. No errors are logged when I stimulate this problem. Does anyone have suggestions on how to get additional data on this issue? I've used dtrace flows to examine, however what I really want to see is the zio's as a result of the getdents, but can't see how to do so. Ideally I'd quiet the system and watch all zio's occurring while I stimulate it, but this is production and not possible. If anyone knows how to watch DMU/ZIO activity that _only_ pertains to a certain PID please let me know. ;) Suggestions on how to pro-actively catch these sorts of instances are welcome, as are alternative explanations. benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zvol Performance
Hello, I'm curious if anyone would mind sharing their experiences with zvol's. I recently started using zvol as an iSCSI backend and was supprised by the performance I was getting. Further testing revealed that it wasn't an iSCSI performance issue but a zvol issue. Testing on a SATA disk locally, I get these numbers (sequentual write): UFS: 38MB/s ZFS: 38MB/s Zvol UFS: 6MB/s Zvol Raw: ~6MB/s ZFS is nice and fast but Zvol performance just drops off a cliff. Suggestion or observations by others using zvol would be extremely helpful. My current testing is being done using a debug build of B44 (NV 6/10/06). benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: NFS Performance and Tar
I was really hoping for some option other than ZIL_DISABLE, but finally gave up the fight. Some people suggested NFSv4 helping over NFSv3 but it didn't... at least not enough to matter. ZIL_DISABLE was the solution, sadly. I'm running B43/X86 and hoping to get up to 48 or so soonish (I BFU'd it straight to B48 last night and brick'ed it). Here are the times. This is an untar (gtar xfj) of SIDEkick (http://www.cuddletech.com/blog/pivot/entry.php?id=491) on NFSv4 on a 20TB RAIDZ2 ZFS Pool: ZIL Enabled: real1m26.941s ZIL Disabled: real0m5.789s I'll update this post again when I finally get B48 or newer on the system and try it. Thanks to everyone for their suggestions. benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
I've got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don't know. Notice vmstat: 3 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 926 91 703 0 25 75 21 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 13 14 1720 21 1105 0 92 8 20 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 17 18 2538 70 834 0 100 0 25 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 745 18 179 0 100 0 37 0 0 25693552 2586240 0 0 0 0 0 0 0 0 0 7 7 1152 52 313 0 100 0 16 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 15 13 1543 52 767 0 100 0 17 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 2 2 890 72 192 0 100 0 27 0 0 25693572 2586260 0 0 0 0 0 0 0 0 0 15 15 3271 19 3103 0 98 2 0 0 0 25693456 2586144 0 11 0 0 0 0 0 0 0 281 249 34335 242 37289 0 46 54 0 0 0 25693448 2586136 0 2 0 0 0 0 0 0 0 0 0 2470 103 2900 0 27 73 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1062 105 822 0 26 74 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1076 91 857 0 25 75 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 917 126 674 0 25 75 These spikes of sys load come in waves like this. While there are close to a hundred systems mounting NFS shares on the Thumper, the amount of traffic is really low. Nothing to justify this. We're talking less than 10MB/s. NFS is pathetically slow. We're using NFSv3 TCP shared via ZFS sharenfs on a 3Gbps aggregation (3*1Gbps). I've been slamming my head against this problem for days and can't make headway. I'll post some of my notes below. Any thoughts or ideas are welcome! benr. === Step 1 was to disable any ZFS features that might consume large amounts of CPU: # zfs set compression=off joyous # zfs set atime=off joyous # zfs set checksum=off joyous These changes had no effect. Next was to consider that perhaps NFS was doing name lookups when it shouldn't. Indeed "dns" was specified in /etc/nsswitch.conf which won't work given that no DNS servers are accessable from the storage or private networks, but again, no improvement. In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the dns/client service in SMF. Turning back to CPU usage, we can see the activity is all SYStem time and comes in waves: [private:/tmp] root# sar 1 100 SunOS private.thumper1 5.11 snv_43 i86pc12/07/2006 10:38:05%usr%sys%wio %idle 10:38:06 0 27 0 73 10:38:07 0 27 0 73 10:38:09 0 27 0 73 10:38:10 1 26 0 73 10:38:11 0 26 0 74 10:38:12 0 26 0 74 10:38:13 0 24 0 76 10:38:14 0 6 0 94 10:38:15 0 7 0 93 10:38:22 0 99 0 1 <-- 10:38:23 0 94 0 6 <-- 10:38:24 0 28 0 72 10:38:25 0 27 0 73 10:38:26 0 27 0 73 10:38:27 0 27 0 73 10:38:28 0 27 0 73 10:38:29 1 30 0 69 10:38:30 0 27 0 73 And so we consider whether or not there is a pattern to the frequency. The following is sar output from any lines in which sys is above 90%: 10:40:04%usr%sys%wio %idleDelta 10:40:11 0 97 0 3 10:40:45 0 98 0 2 34 seconds 10:41:02 0 94 0 6 17 seconds 10:41:26 0 100 0 0 24 seconds 10:42:00 0 100 0 0 34 seconds 10:42:25 (end of sample) >25 seconds Looking at the congestion in the run queue: [private:/tmp] root# sar -q 5 100 10:45:43 runq-sz %runocc swpq-sz %swpocc 10:45:5127.0 85 0.0 0 10:45:57 1.0 20 0.0 0 10:46:02 2.0 60 0.0 0 10:46:1319.8 99 0.0 0 10:46:2317.7 99 0.0 0 10:46:3424.4 99 0.0 0 10:46:4122.1 97 0.0 0 10:46:4813.0 96 0.0 0 10:46:5525.3 102 0.0 0 Looking at the per-CPU breakdown: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 00 324 224000 1540 00 100 0 0 10 00 1140 2260 10 130860 1 0 99 20 00 162 138 1490540 00 1 0 99 30 00556 460430 00 1 0 99 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 00 310 210 340 17 1717 50 100 0 0 10 00 1521 2000 17 265591 65 0 34 20 00 271 197 1751 13 202 00 66 0 34 30 00 12
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
eric kustarz wrote: So i'm guessing there's lots of files being created over NFS in one particular dataset? We should figure out how many creates/second you are doing over NFS (i should have put a timeout on the script). Here's a real simple one (from your snoop it looked like you're only doing NFSv3, so i'm not tracking NFSv4): " #!/usr/sbin/dtrace -s rfs3_create:entry, zfs_create:entry { @creates[probefunc] = count(); } tick-60s { exit(0); } " Eric, I love you. Running this bit of DTrace reveled more than 4,000 files being created in almost any given 60 second window. And I've only got one system that would fit that sort of mass file creation: our Joyent Connector products Courier IMAP server which uses Maildir. As a test I simply shutdown Courier and unmounted the mail NFS share for good measure and sure enough the problem vanished and could not be reproduced. 10 minutes later I re-enabled Courier and our problem came back. Clearly ZFS file creation is just amazingly heavy even with ZIL disabled. If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron cores we're in big trouble in the longer term. In the meantime I'm going to find a new home for our IMAP Mail so that the other things served from that NFS server at least aren't effected. You asked for the zpool and zfs info, which I don't want to share because its confidential (if you want it privately I'll do so, but not on a public list), but I will say that its a single massive Zpool in which we're using less than 2% of the capacity. But in thinking about this problem, even if we used 2 or more pools, the CPU consumption still would have choked the system, right? This leaves me really nervous about what we'll do when its not an internal mail server thats creating all those files but a customer. Oddly enough, this might be a very good reason to use iSCSI instead of NFS on the Thumper. Eric, I owe you a couple cases of beer for sure. I can't tell you how much I appreciate your help. Thanks to everyone else who chimed in with ideas and suggestions, all of you guys are the best! benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Spencer Shepler wrote: Good to hear that you have figured out what is happening, Ben. For future reference, there are two commands that you may want to make use of in observing the behavior of the NFS server and individual filesystems. There is the trusty, nfsstat command. In this case, you would have been able to do something like: nfsstat -s -v3 60 This will provide all of the server side NFSv3 statistics on 60 second intervals. Then there is a new command fsstat that will provide vnode level activity on a per filesystem basis. Therefore, if the NFS server has multiple filesystems active and you want ot look at just one something like this can be helpful: fsstat /export/foo 60 Fsstat has a 'full' option that will list all of the vnode operations or just certain types. It also will watch a filesystem type (e.g. zfs, nfs). Very useful. NFSstat I've been using, but fsstat I was unaware of. Which I'd used it rather than duplicated most of its functionality with D script. :) Thanks for the tip. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Bill Moore wrote: On Fri, Dec 08, 2006 at 12:15:27AM -0800, Ben Rockwood wrote: Clearly ZFS file creation is just amazingly heavy even with ZIL disabled. If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron cores we're in big trouble in the longer term. In the meantime I'm going to find a new home for our IMAP Mail so that the other things served from that NFS server at least aren't effected. For local tests, this is not true of ZFS. It seems that file creation only swamps us when coming over NFS. We can do thousands of files a second on a Thumper with room to spare if NFS isn't involved. Next step is to figure out why NFS kills us. Agreed. If mass file creation was a problem locally I'd think that we'd have people beating down the doors with complaints. One thought I had as a work around was to move all my mail on NFS to an iSCSI LUN and then put a Zpool on that. I'm willing to bet that'd work fine. Hopefully I can try it. To round out the discussion, the root cause of this whole mess was Courier IMAP Locking. After isolating the problem last night and writing a little d script to find out what files were being create it was obviously lock files, turn off locking and file creations dropped to a reasonable level and our problem vanished. If I can help at all with testing or analysis please let me know. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Robert Milkowski wrote: Hello eric, Saturday, December 9, 2006, 7:07:49 PM, you wrote: ek> Jim Mauro wrote: Could be NFS synchronous semantics on file create (followed by repeated flushing of the write cache). What kind of storage are you using (feel free to send privately if you need to) - is it a thumper? It's not clear why NFS-enforced synchronous semantics would induce different behavior than the same load to a local ZFS. ek> Actually i forgot he had 'zil_disable' turned on, so it won't matter in ek> this case. Ben, are you sure zil_disable was set to 1 BEFORE pool was imported? Yes, absolutely. Set var in /etc/system, reboot, system come up. That happened almost 2 months ago, long before this lock insanity problem popped up. To be clear, the ZIL issue was a problem for creation of a handful of files of any size. Untar'ing a file was a massive performance drain. This issue, other the other hand, deals with thousands of little files being created all the time (IMAP Locks). These are separate issues from my point of view. With ZIL slowness NFS performance was just slow but we didn't see massive CPU usage, with this issue on the other hand we were seeing waves in 10 second-ish cycles where the run queue would go sky high with 0 idle. Please see the earlier mails for examples of the symptoms. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS works in waves
Stuart Glenn wrote: A little back story: I have a Norco DS-1220, a 12 bay SATA box, it is connected to eSATA (SiI3124) via PCI-X two drives are straight connections, then the other two ports go to 5x multipliers within the box. My needs/hopes for this was using 12 500GB drives and ZFS make a very large & simple data dump spot on my network for other servers to rsync to daily & use zfs snapshots for some quick backup & if it things worked out start trying to save up towards getting a thumper someday The trouble is it is too slow to really useable. At times it is fast enough to be useable, ~ 13MB/s write. However, this last for only a few minutes. It then just stalls doing nothing. iostat shows 100% blocking for one of the drives in the pool I can however use dd to read or write directly to/from the disks all at the same time with good speed (~30MB/s according to dd) The test pools I have had are either 2 raidz of 6 drives or 3 raidz of 4 drives. The system is using an Athlon 64 3500+ & 1GB of RAM. Any suggestions on what I could do to make this useable? More RAM? Too many drives for ZFS? Any tests to find the real slow down? I would really like to use ZFS & solaris for this. Linux was able to use the same hardware using some beta kernel modules for the sata multipliers & its software raid at an acceptable speed, but I would like to finally rid my network of linux boxen. I have similar issues on my home workstation. They started happening when I put Seagate SATA-II drives with NCQ on a SI3124. I do not believe this to be an issue with ZFS. I've largely dismissed the issue as hardware caused, although I may be wrong. This system has had several problems with SATA-II drives which hardware forums suggest are issues with the nForce4 chipset and SATA-II. Anyway, your not alone, but its not a ZFS issue. Its possible a tunable parameter in the SATA drivers would help. If I find an answer I'll let you know. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS
Andrew Summers wrote: > So, I've read the wikipedia, and have done a lot of research on google about > it, but it just doesn't make sense to me. Correct me if I'm wrong, but you > can take a simple 5/10/20 GB drive or whatever size, and turn it into > exabytes of storage space? > > If that is not true, please explain the importance of this other than the > self heal and those other features. > I'm probably to blame for the image of endless storage. With ZFS Sparse Volumes (aka: Thin Provisioning) you can make a 1G drive _look_ like a 500TB drive, but of course it isn't. See my entry on the topic here: http://www.cuddletech.com/blog/pivot/entry.php?id=729 With ZFS Compression you can, however, potentially store 10GB of data on a 5GB drive. It really depends on what type of data your storing and how compressible it is, but I've seen almost 2:1 compression in some cases by simply turning compression on. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS extra slow?
Brad Plecs wrote: I had a user report extreme slowness on a ZFS filesystem mounted over NFS over the weekend. After some extensive testing, the extreme slowness appears to only occur when a ZFS filesystem is mounted over NFS. One example is doing a 'gtar xzvf php-5.2.0.tar.gz'... over NFS onto a ZFS filesystem. this takes: real5m12.423s user0m0.936s sys 0m4.760s Locally on the server (to the same ZFS filesystem) takes: real0m4.415s user0m1.884s sys 0m3.395s The same job over NFS to a UFS filesystem takes real1m22.725s user0m0.901s sys 0m4.479s Same job locally on server to same UFS filesystem: real0m10.150s user0m2.121s sys 0m4.953s This is easily reproducible even with single large files, but the multiple small files seems to illustrate some awful sync latency between each file. Any idea why ZFS over NFS is so bad? I saw the threads that talk about an fsync penalty, but they don't seem relevant since the local ZFS performance is quite good. Known issue, discussed here: http://www.opensolaris.org/jive/thread.jspa?threadID=14696&tstart=15 benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss