Re: [zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 4:33 PM, wrote: > >>I'm migrating a webserver(apache+php) from RHEL to solaris. During the >>stress testing comparison, I found under the same session number of client >>request, CPU% is ~70% on RHEL while CPU% is full on solaris. > > Which version of Solaris is this? This is Solaris 11. > >>The apache root documentation is apparently in zfs dataset. It looks like each >>file system operation will run into zfs root mutex contention. >> >>Is this an expected behavior? If so, is there any zfs tunable to >>address this issue? > > The zfs_root() function is the function which is used when a > mountpoint is transversed. Where is the apache document root > directory and under which mountpoints? I have an external storage server and I created a zfs pool on it. # zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT export1 800G 277G 523G 34% 1.00x ONLINE - rpool117G 109G 8.44G 92% 1.00x ONLINE - this pool is mounted under /export, all of the apache documents. are in this pool at /export/webserver/apache2/htdocs. The php temporary folder is set to /tmp, which is tmpfs. Thanks, -Aubrey > > Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] volblocksize for VMware VMFS-5
2012-03-26 7:58, Yuri Vorobyev wrote: Hello. What the best practices for choosing ZFS volume volblocksize setting for VMware VMFS-5? VMFS-5 block size is 1Mb. Not sure how it corresponds with ZFS. Setup details follow: - 11 pairs of mirrors; - 600Gb 15k SAS disks; - SSDs for L2ARC and ZIL - COMSTAR FC target; - about 30 virtual machines, mostly Windows (so underlying file systems is NTFS with 4k block) - 3 ESXi hosts. Also, i will glad to hear volume layout suggestions. I see several options: - one big zvol with size equal size of the pool; - one big zvol with size of the pool-20% (to avoid fragmentation); - several zvols (size?). Thanks for attention. You will still see fragmentation because that's the way ZFS works never overwriting recently-live data. It will try to combine new updates to the pool (a transaction group, TXG) as few large writes if contiguous blocks of free space permit. And yes, reserving some space unused should help against fragmentation. I asked on the list, but got no response, whether space reserved as an unused zvol can be used for such antifrag reservation (to forbid other datasets' writes into the bytes otherwise unallocated). Regarding the question on "many zvols": this can be useful. For example, when using dedicated datasets (either zvols or filesystem datasets) for each VM, you can easily clone golden images into preconfigured VM guests on the ZFS side with near-zero overhead (like Sun VDI does). You can also easily expand a (cloned) zvol and then resize the guest FS with its mechanisms, but shrinking is tough (if at all possible) if you ever need it. Also you could store VMs or their parts (i.e. their system disks vs. data disks, or critical VMs vs. testing VMs) in differently configured datasets (perhaps hosted on different disks in the future, like 15k RPM vs. 7k RPM) - then it would make sense to use different zvols (and pools). Or perhaps if you'll make a backup store for Veem backup or any other solution, and emulate a tape drive for bulk storage, you might want a zvol with maximum volblocksize while you might use something else for live VM images... Also note that if you ever plan to use ZFS snapshots, then in case of zvols the system will reserve another zvol size when making even the first snapshot (i.e. you have a 1Gb swap, if you make a snapshot even when it's empty, reservation becomes 2Gb with 1Gb available to user as the block device). This allows to completely rewrite the zvol block device contents and guarantee you'll have enough space to keep both the snapshot and new data. So if snapshots are planned, zvols shouldn't exceed half of pool size. There is no such problem with filesystem snapshots - those only use what has been allocated and referred (not deleted) by some snapshot. Also beware that if you use really small volblocksizes, your pool's metadata to address the volume's blocks would take a considerable amount of overhead compared to your userdata (I'd expect roughly 1md:1ud with minimal blocksize). I had such problems (and numbers) a year ago and wrote on Sun Forums, parts of my woes might make it to this list's archive. Again, this should not be a big problem with files because those use variable length blocks and tend to use large ones when there is enough pending writes, so metadata portion is smaller. So... my counter-question to you and the list: are there substantial benefits to using ZFS as an iSCSI/zvol/VMFS5 provider instead of publishing an NFS service and storing VM images as files? Both resources can be shared to several clients (ESX hosts). I think that for a number of reasons the NFS/files variant is more flexible. What are its drawbacks? I see that you plan to make a COMSTAR FC target, so that networking nuance is one reason for iSCSI vs. "IP over FC to make NFS"... but in general, over jumbo ethernet - which tool suits the task better? :) I heard that VMWare has some smallish limit on the number of NFS connections, but 30 should be bearable... HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] webserver zfs root lock contention under heavy load
2012-03-26 14:27, Aubrey Li wrote: The php temporary folder is set to /tmp, which is tmpfs. By the way, how much RAM does the box have available? "tmpfs" in Solaris is backed by "virtual memory". It is like a RAM disk, although maybe slower than ramdisk FS as seen in livecd, as long as there is enough free RAM, but then tmpfs can get swapped out to disk. If your installation followed the default wizard, your swap (or part of it) could be in rpool/swap and backed by ZFS - leading to both many tmpfs accesses and many ZFS accesses. (No idea about the mutex spinning part though - if it is normal or not). Also I'm not sure if tmpfs gets the benefits of caching, and ZFS ARC cache can consume lots of RAM and thus push tmpfs out to swap. As a random guess, try pointing PHP tmp directory to /var/tmp (backed by zfs) and see if any behaviors change? Good luck, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] webserver zfs root lock contention under heavy load
You care about #2 and #3 because you are fixated on a ZFS root lock contention problem, and not open to a broader discussion about what your real problem actually is. I am not saying there is not lock contention, and I am not saying there is - I'll look at the data later carefully later when I have more time. Your problem statement, which took 20 emails to glean, is the Solaris system consumes more CPU than Linux on the same hardware, doing roughly the same amount of work, and delivering roughly the same level of performance - is that correct? Please consider that, in Linux, you have no observability into kernel lock statistics (at least, known that I know of) - Linux uses kernel locks also, and for this workload, it seems likely to me that could you observe those statistics, you would see numbers that would lead you to conclude you have lock contention in Linux. Let's talk about THE PROBLEM - Linux is 15% sys, 55% usr, Solaris is 30% sys, 70% usr, running the same workload, doing the same amount of work. delivering the same level of performance. Please validate that problem statement. On Mar 25, 2012, at 9:51 PM, Aubrey Li wrote: > On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro wrote: >> If you're chasing CPU utilization, specifically %sys (time in the kernel), >> I would start with a time-based kernel profile. >> >> #dtrace -n 'profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec { >> trunc(@, 20); printa(@0; }' >> >> I would be curious to see where the CPU cycles are being consumed first, >> before going down the lock path… >> >> This assume that most or all of CPU utilization is %sys. If it's %usr, we >> take >> a different approach. >> > > Here is the output, I changed to "tick-5sec" and "trunc(@, 5)". > No.2 and No.3 is what I care about. > > Thanks, > -Aubrey > > 21 80536 :tick-5sec > == 1 = > genunix`avl_walk+0x6a > genunix`as_gap_aligned+0x2b7 > unix`map_addr_proc+0x179 > unix`map_addr+0x8e > genunix`choose_addr+0x9e > zfs`zfs_map+0x161 > genunix`fop_map+0xc5 > genunix`smmap_common+0x268 > genunix`smmaplf32+0xa2 > genunix`syscall_ap+0x92 > unix`_sys_sysenter_post_swapgs+0x149 > 1427 > > = 2 = > unix`mutex_delay_default+0x7 > unix`mutex_vector_enter+0x2ae > zfs`zfs_zget+0x46 > zfs`zfs_root+0x55 > genunix`fsop_root+0x2d > genunix`traverse+0x65 > genunix`lookuppnvp+0x446 > genunix`lookuppnatcred+0x119 > genunix`lookupnameatcred+0x97 > genunix`lookupnameat+0x6b > genunix`vn_openat+0x147 > genunix`copen+0x493 > genunix`openat64+0x2d > unix`_sys_sysenter_post_swapgs+0x149 > 2645 > > == 3 = > unix`mutex_delay_default+0x7 > unix`mutex_vector_enter+0x2ae > zfs`zfs_zget+0x46 > zfs`zfs_root+0x55 > genunix`fsop_root+0x2d > genunix`traverse+0x65 > genunix`lookuppnvp+0x446 > genunix`lookuppnatcred+0x119 > genunix`lookupnameatcred+0x97 > genunix`lookupnameat+0x6b > genunix`cstatat_getvp+0x11e > genunix`cstatat64_32+0x5d > genunix`fstatat64_32+0x4c > unix`_sys_sysenter_post_swapgs+0x149 > 3201 > > 4 === > unix`i86_mwait+0xd > unix`cpu_idle_mwait+0x154 > unix`idle+0x116 > unix`thread_start+0x8 > 3559 > > = 5 == > tmpfs`tmp_readdir+0x138 > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 4589 > > = 6 > unix`strlen+0x3 > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 5005 > > === 7 = > tmpfs`tmp_readdir+0xc7 > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 9548 > > > = 8 === > unix`strlen+0x8 > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 >11166 > > > = 9 === > unix`strlen+0xe > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 >14491 > > = 10 == > tmpfs`tmp_readdir+0xbe >
Re: [zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 7:28 PM, Jim Klimov wrote: > 2012-03-26 14:27, Aubrey Li wrote: >> >> The php temporary folder is set to /tmp, which is tmpfs. >> > > By the way, how much RAM does the box have available? > "tmpfs" in Solaris is backed by "virtual memory". > It is like a RAM disk, although maybe slower than ramdisk > FS as seen in livecd, as long as there is enough free > RAM, but then tmpfs can get swapped out to disk. > > If your installation followed the default wizard, your > swap (or part of it) could be in rpool/swap and backed > by ZFS - leading to both many tmpfs accesses and many > ZFS accesses. (No idea about the mutex spinning part > though - if it is normal or not). > > Also I'm not sure if tmpfs gets the benefits of caching, > and ZFS ARC cache can consume lots of RAM and thus push > tmpfs out to swap. > > As a random guess, try pointing PHP tmp directory to > /var/tmp (backed by zfs) and see if any behaviors change? > > Good luck, > //Jim > Thanks for your suggestions. Actually the default PHP tmp directory was /var/tmp, and I changed "/var/tmp" to "/tmp". This reduced zfs root lock contention significantly. However, I still see a bunch of lock contention. So I'm here ask for help. Thanks, -Aubrey ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 8:24 PM, Jim Mauro wrote: > > You care about #2 and #3 because you are fixated on a ZFS root > lock contention problem, and not open to a broader discussion > about what your real problem actually is. I am not saying there is > not lock contention, and I am not saying there is - I'll look at the > data later carefully later when I have more time. > > Your problem statement, which took 20 emails to glean, is the > Solaris system consumes more CPU than Linux on the same > hardware, doing roughly the same amount of work, and delivering > roughly the same level of performance - is that correct? > > Please consider that, in Linux, you have no observability into > kernel lock statistics (at least, known that I know of) - Linux uses kernel > locks also, and for this workload, it seems likely to me that could > you observe those statistics, you would see numbers that would > lead you to conclude you have lock contention in Linux. > > Let's talk about THE PROBLEM - Linux is 15% sys, 55% usr, > Solaris is 30% sys, 70% usr, running the same workload, > doing the same amount of work. delivering the same level > of performance. Please validate that problem statement. > You're definitely right. I'm running the same workload, doing the same amount of work, delivering the same level of performance on the same hardware platform, even the root disk type are exactly the same. So the userland software stack is exactly the same. The defference is: linuxis 15% sys, 55% usr. solaris is 30% sys, 70% usr. Basically I agree with Fajar. This is probably not a fair comparison. A robost system not only deliver highest performance, we should consider reliability, availability and serviceability, and energy efficiency and other server related features. No doubt ZFS is the most excellent file system in the planet. As Richard pointed out, if we look at mpstat output == SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze 0 35140 0 2380 59742 19476 93056 30906 32919 256336 1104 967806 65 35 0 0 32 Including smtx 256336: spins on mutex I need to look at icsw 30906: involuntary context switches migr 32919: thread migration syscl 967806: system calls And given that Solaris consumes 70% CPU on the userland, I probably need to break down how long it is in apache, libphp, libc, etc. (what's your approach for usr% you mentioned above? BTW) I am not open to a broader discussion because this is zfs mailing list. zfs root lock contention is what I observed so far I can post on this forum. I can take care of other aspects and may ask for help somewhere else. I admit I didn't dig into linux, I agree there could be lock contention as well. But since solaris consumes more CPU% and hence more system power than linux, I think I have to look at solaris first, to see if there are any tuning works need to be done. Do you agree this is the right way to go ahead? Thanks, -Aubrey ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] webserver zfs root lock contention under heavy load
I see nothing unusual in the lockstat data. I think you're barking up the wrong tree. -- richard On Mar 25, 2012, at 10:51 PM, Aubrey Li wrote: > On Mon, Mar 26, 2012 at 1:19 PM, Richard Elling > wrote: >> Apologies to the ZFSers, this thread really belongs elsewhere. >> >> Let me explain below: >> >> Root documentation path of apache is in zfs, you see >> it at No.3 at the above dtrace report. >> >> >> The sort is in reverse order. The large number you see below the >> stack trace is the number of times that stack was seen. By far the >> most frequently seen is tmpfs`tmp_readdir > > It's true, but I didn't see any potential issues there. > >> >> >> tmpfs(/tmp) is the place where PHP place the temporary >> folders and files. >> >> >> bingo >> > > I have to paste the lock investigation here again, > > Firstly, you can see which lock is spinning: > === > # lockstat -D 10 sleep 2 > > Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec) > > Count indv cuml rcnt nsec Lock Caller > --- > 829064 45% 45% 0.0033280 0xff117624a5d0 rrw_enter_read+0x1b > 705001 38% 82% 0.0030983 0xff117624a5d0 rrw_exit+0x1d > 140678 8% 90% 0.0010546 0xff117624a6e0 zfs_zget+0x46 > 37208 2% 92% 0.00 5403 0xff114b136840 vn_rele+0x1e > 33926 2% 94% 0.00 5417 0xff114b136840 lookuppnatcred+0xc5 > 27188 1% 95% 0.00 1155 vn_vfslocks_buckets+0xd980 > vn_vfslocks_getlock+0x3b > 11073 1% 96% 0.00 1639 vn_vfslocks_buckets+0x4600 > vn_vfslocks_getlock+0x3b > 9321 1% 96% 0.00 1961 0xff114b82a680 dnlc_lookup+0x83 > 6929 0% 97% 0.00 1590 0xff11573b8f28 > zfs_fastaccesschk_execute+0x6a > 5746 0% 97% 0.00 5935 0xff114b136840 lookuppnvp+0x566 > --- > > Then if you look at the caller of lock(0xff117624a5d0), you'll see > it's ZFS, not tmpfs. > > Count indv cuml rcnt nsec Lock Caller > 48494 6% 17% 0.00 145263 0xff117624a5d0 rrw_enter_read+0x1b > > nsec -- Time Distribution -- count Stack > 256 | 17rrw_enter+0x2c > 512 | 1120 zfs_root+0x3b > 1024 |@ 1718 fsop_root+0x2d > 2048 |@@ 4834 traverse+0x65 > 4096 |@@@18569 lookuppnvp+0x446 > 8192 | 6620 lookuppnatcred+0x119 > 16384 |@ 2929 lookupnameatcred+0x97 > 32768 |@ 1635 lookupnameat+0x6b > 65536 | 894 cstatat_getvp+0x11e >131072 | 1249 cstatat64_32+0x5d >262144 |@ 1620 fstatat64_32+0x4c >524288 |@ 2474 > _sys_sysenter_post_swapgs+0x149 > > > That's why I post this subject here. Hope it's clear this time. > > Thanks, > -Aubrey -- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Good tower server for around 1,250 USD?
In message , Bob Friesenhahn writes: >Almost all of the systems listed on the HCL are defunct and no longer >purchasable except for on the used market. Obtaining an "approved" >system seems very difficult. In spite of this, Solaris runs very well >on many non-approved modern systems. http://www.oracle.com/webfolder/technetwork/hcl/data/s11ga/systems/views/nonoracle_systems_all_results.mfg.page1.html> >I don't know what that means as far as the ability to purchase Solaris >"support". I believe it must pass the HCTS before Oracle will support Solaris running on third-party hardware. http://www.oracle.com/webfolder/technetwork/hcl/hcts/index.html> John groenv...@acm.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] webserver zfs root lock contention under heavy load
> > As a random guess, try pointing PHP tmp directory to > > /var/tmp (backed by zfs) and see if any behaviors change? > > > > Good luck, > > //Jim > > > > Thanks for your suggestions. Actually the default PHP tmp directory > was /var/tmp, and I changed "/var/tmp" to "/tmp". This reduced zfs > root lock contention significantly. However, I still see a bunch > of lock > contention. So I'm here ask for help. Well, as a further attempt down this road, is it possible for you to rule out ZFS from swapping - i.e. if RAM amounts permit, disable the swap at all (swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated slices of same or better yet separate disks? If you do have lots of swapping activity (that can be seen in "vmstat 1" as si/so columns) going on in a zvol, you're likely to get much fragmentation in the pool, and searching for contiguous stretches of space can become tricky (and time-consuming), or larger writes can get broken down into many smaller random writes and/or "gang blocks", which is also slower. At least such waiting on disks can explain the overall large kernel times. You can also see the disk wait times ratio in "iostat -xzn 1" column "%w" and disk busy times ratio in "%b" (second and third from the right). I dont't remember you posting that. If these are accounting in tens, or even close or equal to 100%, then your disks are the actual bottleneck. Speeding up that subsystem, including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL SSD/DDRDrive) and combatting fragmentation by moving swap and other scratch spaces to dedicated pools or raw slices might help. HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] volblocksize for VMware VMFS-5
On Mar 25, 2012, at 8:58 PM, Yuri Vorobyev wrote: > Hello. > > What the best practices for choosing ZFS volume volblocksize setting for > VMware VMFS-5? > VMFS-5 block size is 1Mb. Not sure how it corresponds with ZFS. Zero correlation. What I see on the wire from VMFS is 16KB random reads followed by 4KB random writes. VMFS reads the same set of 16KB data again and again, so it tends to get nicely cached in the MFU portion of the ARC. > > Setup details follow: > - 11 pairs of mirrors; > - 600Gb 15k SAS disks; > - SSDs for L2ARC and ZIL > - COMSTAR FC target; > - about 30 virtual machines, mostly Windows (so underlying file systems is > NTFS with 4k block) > - 3 ESXi hosts. > > Also, i will glad to hear volume layout suggestions. > I see several options: > - one big zvol with size equal size of the pool; > - one big zvol with size of the pool-20% (to avoid fragmentation); A zvol with reservation is one way of implemented the reservations often seen in other file systems. > - several zvols (size?). In general, for COMSTAR, more LUs is better. -- richard -- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] test for holes in a file?
How can I test if a file on ZFS has holes, i.e. is a sparse file, using the C api? Olga -- , _ _ , { \/`o;- Olga Kryzhanovska -;o`\/ } .'-/`-/ olga.kryzhanov...@gmail.com \-`\-'. `'-..-| / http://twitter.com/fleyta \ |-..-'` /\/\ Solaris/BSD//C/C++ programmer /\/\ `--` `--` ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
2012/3/26 ольга крыжановская : > How can I test if a file on ZFS has holes, i.e. is a sparse file, > using the C api? See SEEK_HOLE in lseek(2). -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
Mike, I was hoping that some one has a complete example for a bool has_file_one_or_more_holes(const char *path) function. Olga 2012/3/26 Mike Gerdts : > 2012/3/26 ольга крыжановская : >> How can I test if a file on ZFS has holes, i.e. is a sparse file, >> using the C api? > > See SEEK_HOLE in lseek(2). > > -- > Mike Gerdts > http://mgerdts.blogspot.com/ -- , _ _ , { \/`o;- Olga Kryzhanovska -;o`\/ } .'-/`-/ olga.kryzhanov...@gmail.com \-`\-'. `'-..-| / http://twitter.com/fleyta \ |-..-'` /\/\ Solaris/BSD//C/C++ programmer /\/\ `--` `--` ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
? wrote: > How can I test if a file on ZFS has holes, i.e. is a sparse file, > using the C api? See star . ftp://ftp.berlios.de/pub/star/ or http://hg.berlios.de/repos/schillix-on/file/e3829115a7a4/usr/src/cmd/star/hole.c The interface was defined for star in September 2004, star added support in May 2005 after the interface was implemented. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
I just played and knocked this up (note the stunning lack of comments, missing optarg processing, etc)... Give it a list of files to check... #define _FILE_OFFSET_BITS 64 #include #include #include #include #include int main(int argc, char **argv) { int i; for (i = 1; i< argc; i++) { int fd; fd = open(argv[i], O_RDONLY); if (fd< 0) { perror(argv[i]); } else { off_t eof; off_t hole; if (((eof = lseek(fd, 0, SEEK_END))< 0) || lseek(fd, 0, SEEK_SET)< 0) { perror(argv[i]); } else if (eof == 0) { printf("%s: empty\n", argv[i]); } else { hole = lseek(fd, 0, SEEK_HOLE); if (hole< 0) { perror(argv[i]); } else if (hole< eof) { printf("%s: sparse\n", argv[i]); } else { printf("%s: not sparse\n", argv[i]); } } close(fd); } } return 0; } On 03/26/12 10:06 PM, ольга крыжановская wrote: Mike, I was hoping that some one has a complete example for a bool has_file_one_or_more_holes(const char *path) function. Olga 2012/3/26 Mike Gerdts: 2012/3/26 ольга крыжановская: How can I test if a file on ZFS has holes, i.e. is a sparse file, using the C api? See SEEK_HOLE in lseek(2). -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
On Mon, 26 Mar 2012, Andrew Gabriel wrote: I just played and knocked this up (note the stunning lack of comments, missing optarg processing, etc)... Give it a list of files to check... This is a cool program, but programmers were asking (and answering) this same question 20+ years ago before there was anything like SEEK_HOLE. If file space usage is less than file directory size then it must contain a hole. Even for compressed files, I am pretty sure that Solaris reports the uncompressed space usage. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
On Mon, Mar 26, 2012 at 6:18 PM, Bob Friesenhahn wrote: > On Mon, 26 Mar 2012, Andrew Gabriel wrote: > >> I just played and knocked this up (note the stunning lack of comments, >> missing optarg processing, etc)... >> Give it a list of files to check... > > > This is a cool program, but programmers were asking (and answering) this > same question 20+ years ago before there was anything like SEEK_HOLE. > > If file space usage is less than file directory size then it must contain a > hole. Even for compressed files, I am pretty sure that Solaris reports the > uncompressed space usage. That's not the case. # zfs create -o compression=on rpool/junk # perl -e 'print "foo" x 10'> /rpool/junk/foo # ls -ld /rpool/junk/foo -rw-r--r-- 1 root root 30 Mar 26 18:25 /rpool/junk/foo # du -h /rpool/junk/foo 16K /rpool/junk/foo # truss -t stat -v stat du /rpool/junk/foo ... lstat64("foo", 0x08047C40) = 0 d=0x02B90028 i=8 m=0100644 l=1 u=0 g=0 sz=30 at = Mar 26 18:25:25 CDT 2012 [ 1332804325.742827733 ] mt = Mar 26 18:25:25 CDT 2012 [ 1332804325.889143166 ] ct = Mar 26 18:25:25 CDT 2012 [ 1332804325.889143166 ] bsz=131072 blks=32fs=zfs Notice that it says it has 32 512 byte blocks. The mechanism you suggest does work for every other file system that I've tried it on. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
On Mar 26, 2012, at 4:18 PM, Bob Friesenhahn wrote: > On Mon, 26 Mar 2012, Andrew Gabriel wrote: > >> I just played and knocked this up (note the stunning lack of comments, >> missing optarg processing, etc)... >> Give it a list of files to check... > > This is a cool program, but programmers were asking (and answering) this same > question 20+ years ago before there was anything like SEEK_HOLE. > > If file space usage is less than file directory size then it must contain a > hole. Even for compressed files, I am pretty sure that Solaris reports the > uncompressed space usage. +1 Also, prior to ZFS, you could look at the length of the file (ls -l or stat struct st_size) and compare to the size (ls -ls or stat struct st_blocks). If length > size (unit adjusted, rounded up) then there are holes. In ZFS, this can be more difficult, because the size can be larger than the length (!) due to copies. Also, if you have compression enabled, size can be < length. -- richard -- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] webserver zfs root lock contention under heavy load
On Tue, Mar 27, 2012 at 1:15 AM, Jim Klimov wrote: > Well, as a further attempt down this road, is it possible for you to rule > out > ZFS from swapping - i.e. if RAM amounts permit, disable the swap at all > (swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated slices of > same or better yet separate disks? > Thanks Jim for your suggestion! > If you do have lots of swapping activity (that can be seen in "vmstat 1" as > si/so columns) going on in a zvol, you're likely to get much fragmentation > in the pool, and searching for contiguous stretches of space can become > tricky (and time-consuming), or larger writes can get broken down into > many smaller random writes and/or "gang blocks", which is also slower. > At least such waiting on disks can explain the overall large kernel times. I took swapping activity into account, even when the CPU% is 100%, "si" (swap-ins) and "so" (swap-outs) are always ZEROs. > > You can also see the disk wait times ratio in "iostat -xzn 1" column "%w" > and disk busy times ratio in "%b" (second and third from the right). > I dont't remember you posting that. > > If these are accounting in tens, or even close or equal to 100%, then > your disks are the actual bottleneck. Speeding up that subsystem, > including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL > SSD/DDRDrive) and combatting fragmentation by moving swap and > other scratch spaces to dedicated pools or raw slices might help. My storage system is not quite busy, and there are only read operations. = # iostat -xnz 3 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 112.40.0 1691.50.0 0.0 0.50.04.8 0 41 c11t0d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 118.70.0 1867.00.0 0.0 0.50.04.5 0 42 c11t0d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 127.70.0 2121.60.0 0.0 0.60.04.7 0 44 c11t0d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 141.30.0 2158.50.0 0.0 0.70.04.6 0 48 c11t0d0 == Thanks, -Aubrey ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss