I am not entirely sure about significant slowdown and cluster outage. But from your description and information you provided, you are seeing fragmentation related issues. What is the ocfs2/kernel version and what is the cluster size/block size of these volumes?
On 07/15/2013 01:33 PM, Gavin Jones wrote: > Hello, > > We have a 16 node OCFS2 cluster used for web serving duties. Each > node mounts (the same) 6 OCFS2 volumes. Shared data includes client > files, application files for our webapp, log files, configuration > files. Storage provided by 2x EqualLogic PS400E iSCSI SANs, each > having 12 drives in a RAID50; units are in a 'Group'. > > The problem we are having is that periodically, maybe once a week or > so, we get several Apache processes on a handful of nodes that get > stuck in D state and are unable to recover. This greatly increases > server load, causes more Apache processes to backup, OCFS2 starts > complaining about unresponsive nodes and before you know it, the > cluster is down. > > This seems to occur most often when we are doing writes + reads; if it > is just reads the cluster hums along. However, when we need to update > many files or remove lots of files (think temporary images) in > addition to normal read activity, we have the above-mentioned problem. > > We have done some searching and found > http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg05525.html > which describes a similar problem with write activity. In that case, > the problem was allocating contiguous space on a fragmented filesystem > and the solution was to adjust the mount option 'localalloc'. We are > wondering if we are in a similar position. > > Below is the output from the stat_sysdir_analyze.sh script mentioned > in the link above, which analyzes stat_sysdir.sh output; I've included > the two volumes that seem to be our 'problem' volumes. > > Volume 1: > bash stat_sysdir_analyze.sh sde1-client-20130715.txt > Number | > of | > clust. | Contiguous cluster size > -------------------------------- > 4549 510 and smaller > 1825 511 > > Volume 2: > bash stat_sysdir_analyze.sh sdd1-data-20130715.txt > Number | > of | > clust. | Contiguous cluster size > -------------------------------- > 175 510 and smaller > 23 511 > > Any evidence here of excessive fragmentation that tuning localalloc > would help with? > > Also regarding localalloc, I notice it is different for the above two > volumes on many of the nodes; I find this interesting as the cluster > is supposed to make an educated guess on this value. For instance: > > /dev/sda1 on /u/client type ocfs2 > (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=6,coherency=full,user_xattr,noacl) > /dev/sde1 on /u/data type ocfs2 > (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=5,coherency=full,user_xattr,noacl) > > > /dev/sdd1 on /u/client type ocfs2 > (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=9,coherency=full,user_xattr,noacl) > /dev/sdb1 on /u/data type ocfs2 > (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=5,coherency=full,user_xattr,noacl) > > > /dev/sda1 on /u/client type ocfs2 > (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=11,coherency=full,user_xattr,noacl) > /dev/sdc1 on /u/data type ocfs2 > (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=5,coherency=full,user_xattr,noacl) > > > /dev/sda1 on /u/client type ocfs2 > (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=6,coherency=full,user_xattr,noacl) > /dev/sdc1 on /u/data type ocfs2 > (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=7,coherency=full,user_xattr,noacl > > I'm not sure why the cluster would be picking different values > depending on the node? > > Anyway, any opinions, advice, tuning suggestions greatly appreciated. > This business of the cluster hanging is turning into quite a problem. > > I'll provide any other requested information upon request. > > Thanks, > > Gavin W. Jones > Where 2 Get It, Inc. > > -- > "There has grown up in the minds of certain groups in this country the > notion that because a man or corporation has made a profit out of the > public for a number of years, the government and the courts are > charged with the duty of guaranteeing such profit in the future, even > in the face of changing circumstances and contrary to public interest. > This strange doctrine is not supported by statute nor common law." > > ~Robert Heinlein > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users