On 3/5/2013 3:27 AM, Jeremy Chadwick wrote: > On Tue, Mar 05, 2013 at 09:12:47AM -0000, Steven Hartland wrote: >> ----- Original Message ----- From: "Jeremy Chadwick" >> <j...@koitsu.org> >> To: "Ben Morrow" <b...@morrow.me.uk> >> Cc: <freebsd-stable@freebsd.org> >> Sent: Tuesday, March 05, 2013 5:32 AM >> Subject: Re: ZFS "stalls" -- and maybe we should be talking about defaults? >> >> >>> On Tue, Mar 05, 2013 at 05:05:47AM +0000, Ben Morrow wrote: >>>> Quoth Karl Denninger <k...@denninger.net>: >>>>>> Note that the machine is not booting from ZFS -- it is >>>> booting from and >>>>> has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks >>>>> like a single "da0" drive to the OS) and that drive stalls as well when >>>>> it freezes. It's definitely a kernel thing when it happens as the OS >>>>> would otherwise not have locked (just I/O to the user partitions) -- but >>>>> it does. >>>> Is it still the case that mixing UFS and ZFS can cause problems, or were >>>> they all fixed? I remember a while ago (before the arc usage monitoring >>>> code was added) there were a number of reports of serious probles >>>> running an rsync from UFS to ZFS. >>> This problem still exists on stable/9. The behaviour manifests itself >>> as fairly bad performance (I cannot remember if stalling or if just >>> throughput rates were awful). I can only speculate as to what the root >>> cause is, but my guess is that it has something to do with the two >>> caching systems (UFS vs. ZFS ARC) fighting over large sums of memory. >> In our case we have no UFS, so this isn't the cause of the stalls. >> Spec here is >> * 64GB RAM >> * LSI 2008 >> * 8.3-RELEASE >> * Pure ZFS >> * Trigger MySQL doing a DB import, nothing else running. >> * 4K disk alignment > 1. Is compression enabled? Has it ever been enabled (on any fs) in the > past (barring pool being destroyed + recreated)? > > 2. Is dedup enabled? Has it ever been enabled (on any fs) in the past > (barring pool being destroyed + recreated)? > > I can speculate day and night about what could cause this kind of issue, > honestly. The possibilities are quite literally infinite, and all of > them require folks deeply familiar with both FreeBSD's ZFS as well as > very key/major parts of the kernel (ranging from VM to interrupt > handlers to I/O subsystem). (This next comment isn't for you, Steve, > you already know this :-) ) The way different pieces of the kernel > interact with one another is fairly complex; the kernel is not simple. > > Things I think that might prove useful: > > * Describing the stall symptoms; what all does it impact? Can you > switch VTYs on console when its happening? Network I/O (e.g. SSH'd > into the same box and just holding down a letter) showing stalls > then catching up? Things of this nature. When it happens on my system anything that is CPU-bound continues to execute. I can switch consoles and network I/O also works. If I have an iostat running at the time all I/O counters go to and remain at zero while the stall is occurring, but the process that is producing the iostat continues to run and emit characters whether it is a ssh session or on the physical console.
The CPUs are running and processing, but all threads block if they attempt access to the disk I/O subsystem, irrespective of the portion of the disk I/O subsystem they attempt to access (e.g. UFS, swap or ZFS) I therefore cannot start any new process that requires image activation. > * How long the stall is in duration (ex. if there's some way to > roughly calculate this using "date" in a shell script) They're variable. Some last fractions of a second and are not really all that noticeable unless you happen to be paying CLOSE attention. Some last a few (5 or so) seconds. The really bad ones last long enough that the kernel throws the message "swap_pager: indefinite wait buffer". The machine in the general sense never pages. It contains 12GB of RAM but historically (prior to ZFS being put into service) always showed "0" for a "pstat -s", although it does have a 20g raw swap partition (to /dev/da0s1b, not to a zpool) allocated. During the stalls I cannot run a pstat (I tried; it stalls) but when it unlocks I find that there is swap allocated, albeit not a ridiculous amount. ~20,000 pages or so have made it to the swap partition. This is not behavior that I had seen before on this machine prior to the stall problem, and with the two tuning tweaks discussed here I'm now up to 48 hours without any allocation to swap (or any stalls.) > * Contents of /etc/sysctl.conf and /boot/loader.conf (re: "tweaking" > of the system) /boot/loader.conf: kern.ipc.semmni=256 kern.ipc.semmns=512 kern.ipc.semmnu=256 geom_eli_load="YES" sound_load="YES" # # Limit to physical CPU count for threads # kern.geom.eli.threads=8 # # ZFS Prefetch does help, although you'd think it would not due to the adapter # doing it already. Wrong guess; it's good for 2x the performance. # We limit the ARC to 2GB of RAM and the TXG write limit to 1GB. # #vfs.zfs.prefetch_disable="1" vfs.zfs.arc_max=2000000000 vfs.zfs.write_limit_override=1024000000 -------------------------------- The first three are required for Postgres. The geli thread limit has been found to provide better performance under heavy load, as the system will otherwise start 16 threads per geli-attached provider since the CPUs support hyperthreading. The two ZFS-related entries at the end, if present, stop the stalls. Geli is not used on the boot pack; da0 is an old-style MBR disk that is physically comprised of two 300MB drives in a mirror managed by the adapter. Swap resides on the traditional "b" slice of that pack; it is a reasonably-standard "old-style" setup in that regard with separate root, /home, /var and /usr slices. sysctl.conf contains: # $FreeBSD: src/etc/sysctl.conf,v 1.8 2003/03/13 18:43:50 mux Exp $ # # This file is read when going to multi-user and its contents piped thru # ``sysctl'' to adjust kernel values. ``man 5 sysctl.conf'' for details. # # Uncomment this to prevent users from seeing information about processes that # are being run under another UID. #security.bsd.see_other_uids=0 # # tuning for PostgreSQL # kern.ipc.shm_use_phys=1 kern.ipc.shmmax=4096000000 kern.ipc.shmall=1000000 kern.ipc.semmsl=512 kern.ipc.semmap=256 # # IP Performance # kern.ipc.somaxconn=4096 kern.ipc.nmbclusters=32768 net.inet.tcp.sendspace=131072 net.inet.tcp.recvspace=131072 net.inet.tcp.inflight.enable=1 # # Tune for asshole (DDOS) resistance # net.inet.tcp.blackhole=2 net.inet.udp.blackhole=1 net.inet.icmp.icmplim=10 net.inet.tcp.imcp_may_rst=0 net.inet.tcp.drop_synfin=1 net.inet.tcp.msl=7500 # # Maxfiles # kern.maxfiles=65535 I suspect (but can't yet prove) that wiring shared memory is likely involved in this. That makes a BIG difference in Postgres performance, but I can certainly see where a misbehaving ARC cache could "think" that the (rather large) shared segment that Postgres has (it currently allocates 1.5G of shared memory and wires it) can or might "get out of the way." But it most-certainly won't with kern.ipc.shm_use_phys set. In normal operation that Postgres server is a hot-spare replication machine that connects to Asheville; in the event of a catastrophic failure there it would be promoted and the load would shift here. > * "sysctl -a | grep zfs" before and after a stall -- do not bother > with those "ARC summaries" scripts please, at least not for this > * "vmstat -z" before and after a stall > * "vmstat -m" before and after a stall > * "vmstat -s" before and after a stall > * "vmstat -i" before, after, AND during a stall > > Basically, every person who experiences this problem needs to treat > every situation uniquely -- no "me too" -- and try to find reliable 100% > test cases for it. That's the only way bugs of this nature (i.e. > of a complex nature) get fixed. I am fortunate enough to have an identical machine that's "cold" in the rack and will effort spinning that up today; I'm going to attach another pack to the backup and allow it to resilver, then use that "in anger" to restore the spare box. I'm quite sure I can reproduce the workload that causes the stalls; populating the backup pack as a separate zfs pool (with zfs send | zfs recv) was what led to it happening here originally. With that said I've got more than 24 hours on the box that exhibited the problem with the two tunables in /boot/loader.conf and a sentinal process that is doing a zpool iostat 5 looking for more than one "all zeros" I/O line sequentially. It hasn't happened since I stuck those two lines in there and at this point two nightly backup runs have gone to completion along with some fairly heavy user I/O last evening which was plenty of load to provoke the misbehavior previously. -- -- Karl Denninger /The Market Ticker ®/ <http://market-ticker.org> Cuda Systems LLC _______________________________________________ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"