Hello, We have a 16 node OCFS2 cluster used for web serving duties. Each node mounts (the same) 6 OCFS2 volumes. Shared data includes client files, application files for our webapp, log files, configuration files. Storage provided by 2x EqualLogic PS400E iSCSI SANs, each having 12 drives in a RAID50; units are in a 'Group'.
The problem we are having is that periodically, maybe once a week or so, we get several Apache processes on a handful of nodes that get stuck in D state and are unable to recover. This greatly increases server load, causes more Apache processes to backup, OCFS2 starts complaining about unresponsive nodes and before you know it, the cluster is down. This seems to occur most often when we are doing writes + reads; if it is just reads the cluster hums along. However, when we need to update many files or remove lots of files (think temporary images) in addition to normal read activity, we have the above-mentioned problem. We have done some searching and found http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg05525.html which describes a similar problem with write activity. In that case, the problem was allocating contiguous space on a fragmented filesystem and the solution was to adjust the mount option 'localalloc'. We are wondering if we are in a similar position. Below is the output from the stat_sysdir_analyze.sh script mentioned in the link above, which analyzes stat_sysdir.sh output; I've included the two volumes that seem to be our 'problem' volumes. Volume 1: bash stat_sysdir_analyze.sh sde1-client-20130715.txt Number | of | clust. | Contiguous cluster size -------------------------------- 4549 510 and smaller 1825 511 Volume 2: bash stat_sysdir_analyze.sh sdd1-data-20130715.txt Number | of | clust. | Contiguous cluster size -------------------------------- 175 510 and smaller 23 511 Any evidence here of excessive fragmentation that tuning localalloc would help with? Also regarding localalloc, I notice it is different for the above two volumes on many of the nodes; I find this interesting as the cluster is supposed to make an educated guess on this value. For instance: /dev/sda1 on /u/client type ocfs2 (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=6,coherency=full,user_xattr,noacl) /dev/sde1 on /u/data type ocfs2 (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=5,coherency=full,user_xattr,noacl) /dev/sdd1 on /u/client type ocfs2 (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=9,coherency=full,user_xattr,noacl) /dev/sdb1 on /u/data type ocfs2 (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=5,coherency=full,user_xattr,noacl) /dev/sda1 on /u/client type ocfs2 (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=11,coherency=full,user_xattr,noacl) /dev/sdc1 on /u/data type ocfs2 (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=5,coherency=full,user_xattr,noacl) /dev/sda1 on /u/client type ocfs2 (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=6,coherency=full,user_xattr,noacl) /dev/sdc1 on /u/data type ocfs2 (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=7,coherency=full,user_xattr,noacl I'm not sure why the cluster would be picking different values depending on the node? Anyway, any opinions, advice, tuning suggestions greatly appreciated. This business of the cluster hanging is turning into quite a problem. I'll provide any other requested information upon request. Thanks, Gavin W. Jones Where 2 Get It, Inc. -- "There has grown up in the minds of certain groups in this country the notion that because a man or corporation has made a profit out of the public for a number of years, the government and the courts are charged with the duty of guaranteeing such profit in the future, even in the face of changing circumstances and contrary to public interest. This strange doctrine is not supported by statute nor common law." ~Robert Heinlein _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users