Hi Gavin, On 07/16/2013 01:17 PM, Gavin Jones wrote: > Hello, > > Apologies for my earlier reply, I did not see the request for "Cluster > size" as well as block size. > > According to o2info, cluster size is 65536. > > Thanks, > > Gavin W. Jones > Where 2 Get It, Inc. > > On Tue, Jul 16, 2013 at 9:58 AM, Gavin Jones <gjo...@where2getit.com> wrote: >> Hello, >> >> Block size: 4kB >> >> Kernel version: 3.4.6-2.10-default >> >> OCFS2: 1.5.0 >> >> Distribution is openSUSE 12.2. >> >> Thanks, >> >> Gavin W. Jones >> Where 2 Get It, Inc. >> >> On Mon, Jul 15, 2013 at 7:32 PM, Srinivas Eeda <srinivas.e...@oracle.com> >> wrote: >>> I am not entirely sure about significant slowdown and cluster outage. >>> But from your description and information you provided, you are seeing >>> fragmentation related issues. What is the ocfs2/kernel version and what >>> is the cluster size/block size of these volumes? >>> >>> >>> On 07/15/2013 01:33 PM, Gavin Jones wrote: >>>> Hello, >>>> >>>> We have a 16 node OCFS2 cluster used for web serving duties. Each >>>> node mounts (the same) 6 OCFS2 volumes. Shared data includes client >>>> files, application files for our webapp, log files, configuration >>>> files. Storage provided by 2x EqualLogic PS400E iSCSI SANs, each >>>> having 12 drives in a RAID50; units are in a 'Group'. >>>> >>>> The problem we are having is that periodically, maybe once a week or >>>> so, we get several Apache processes on a handful of nodes that get >>>> stuck in D state and are unable to recover. This greatly increases >>>> server load, causes more Apache processes to backup, OCFS2 starts >>>> complaining about unresponsive nodes and before you know it, the >>>> cluster is down.
This seems like a DLM issue. Could you provide the /proc/<pid>/stack of the process when the issue happens next? Does it change over time? If it is indeed stuck waiting on a DLM lock, the debug logs of DLM* might help (debug.ocfs2 -l). >>>> >>>> This seems to occur most often when we are doing writes + reads; if it >>>> is just reads the cluster hums along. However, when we need to update >>>> many files or remove lots of files (think temporary images) in >>>> addition to normal read activity, we have the above-mentioned problem. >>>> >>>> We have done some searching and found >>>> http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg05525.html >>>> which describes a similar problem with write activity. In that case, >>>> the problem was allocating contiguous space on a fragmented filesystem >>>> and the solution was to adjust the mount option 'localalloc'. We are >>>> wondering if we are in a similar position. >>>> >>>> Below is the output from the stat_sysdir_analyze.sh script mentioned >>>> in the link above, which analyzes stat_sysdir.sh output; I've included >>>> the two volumes that seem to be our 'problem' volumes. >>>> >>>> Volume 1: >>>> bash stat_sysdir_analyze.sh sde1-client-20130715.txt >>>> Number | >>>> of | >>>> clust. | Contiguous cluster size >>>> -------------------------------- >>>> 4549 510 and smaller >>>> 1825 511 >>>> >>>> Volume 2: >>>> bash stat_sysdir_analyze.sh sdd1-data-20130715.txt >>>> Number | >>>> of | >>>> clust. | Contiguous cluster size >>>> -------------------------------- >>>> 175 510 and smaller >>>> 23 511 >>>> >>>> Any evidence here of excessive fragmentation that tuning localalloc >>>> would help with? >>>> >>>> Also regarding localalloc, I notice it is different for the above two >>>> volumes on many of the nodes; I find this interesting as the cluster >>>> is supposed to make an educated guess on this value. For instance: >>>> >>>> /dev/sda1 on /u/client type ocfs2 >>>> (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=6,coherency=full,user_xattr,noacl) >>>> /dev/sde1 on /u/data type ocfs2 >>>> (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=5,coherency=full,user_xattr,noacl) >>>> >>>> >>>> /dev/sdd1 on /u/client type ocfs2 >>>> (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=9,coherency=full,user_xattr,noacl) >>>> /dev/sdb1 on /u/data type ocfs2 >>>> (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=5,coherency=full,user_xattr,noacl) >>>> >>>> >>>> /dev/sda1 on /u/client type ocfs2 >>>> (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=11,coherency=full,user_xattr,noacl) >>>> /dev/sdc1 on /u/data type ocfs2 >>>> (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=5,coherency=full,user_xattr,noacl) >>>> >>>> >>>> /dev/sda1 on /u/client type ocfs2 >>>> (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=6,coherency=full,user_xattr,noacl) >>>> /dev/sdc1 on /u/data type ocfs2 >>>> (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,localalloc=7,coherency=full,user_xattr,noacl >>>> >>>> I'm not sure why the cluster would be picking different values >>>> depending on the node? >>>> >>>> Anyway, any opinions, advice, tuning suggestions greatly appreciated. >>>> This business of the cluster hanging is turning into quite a problem. >>>> >>>> I'll provide any other requested information upon request. >>>> >>>> Thanks, >>>> >>>> Gavin W. Jones >>>> Where 2 Get It, Inc. >>>> >>>> -- >>>> "There has grown up in the minds of certain groups in this country the >>>> notion that because a man or corporation has made a profit out of the >>>> public for a number of years, the government and the courts are >>>> charged with the duty of guaranteeing such profit in the future, even >>>> in the face of changing circumstances and contrary to public interest. >>>> This strange doctrine is not supported by statute nor common law." >>>> >>>> ~Robert Heinlein >>>> >>>> _______________________________________________ >>>> Ocfs2-users mailing list >>>> Ocfs2-users@oss.oracle.com >>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users >>> >>> >>> _______________________________________________ >>> Ocfs2-users mailing list >>> Ocfs2-users@oss.oracle.com >>> https://oss.oracle.com/mailman/listinfo/ocfs2-users >> >> >> >> -- >> "There has grown up in the minds of certain groups in this country the >> notion that because a man or corporation has made a profit out of the >> public for a number of years, the government and the courts are >> charged with the duty of guaranteeing such profit in the future, even >> in the face of changing circumstances and contrary to public interest. >> This strange doctrine is not supported by statute nor common law." >> >> ~Robert Heinlein > > > -- Goldwyn _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users