Is this on SLES by any chance? SUSE are about the only ones with knowledge in this area I'm afraid.
On Tue, May 15, 2012 at 6:01 AM, Matthew O'Connor <m...@ecsorl.com> wrote: > Hi! > > I ran into the issue of ocfs2_controld.pcmk consuming vast CPU again - > twice, actually. The most recent happenstance was after a multi-node > failure. One node stayed alive, two nodes had to be rebooted. After > the reboots, one of the two came back without issue, and was able to > mount the OCFS2 stores. The second node exhibited high-cpu usage on the > ocfs2_controld.pcmk process, and could not mount the OCFS2 stores. The > logs were being voraciously filled with the following message: > > ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object > does not exist > > This message was being output so frequently that syslogd was starting to > rate-limit it. I suspect this accounts for the high CPU usage. After > restarting the troubled node several times, I found the solution was to > order the OCFS2/DLM resource group to stop, cluster-wide, and then > restart it. Normal behavior followed. (In a prior post to the list, I > referenced hard-killing the ocfs2_controld.pcmk process. This was a > more graceful shutdown.) > > Attached are two strace outputs. I'm sorry I'm not very familiar with > strace, so the value of these files may be questionable. If there is > anything else I can provide the next time this happens, I'd be happy to > do so! The log-f.txt file was generated with the -f option, and the > log-fc.txt file was generated with -f -c. > > Here also is a snippet from the syslog, during the cluster-wide shutdown > of the OCFS2/DLM group: > > May 14 15:22:13 gw05 ocfs2_controld: Unable to open checkpoint > "ocfs2:controld": Object does not exist > May 14 15:22:14 ocfs2_controld: last message repeated 199 times > May 14 15:22:15 gw05 o2cb[4134]: INFO: Stopping ocfs2_controld.pcmk > May 14 15:22:16 gw05 dlm_controld.pcmk: [3411]: notice: > terminate_ais_connection: Disconnecting from AIS > May 14 15:22:16 gw05 lrmd: [2993]: info: RA output: > (p_dlm:2:stop:stderr) dlm_controld.pcmk: no process found > May 14 15:22:19 gw05 ocfs2_controld: Unable to open checkpoint > "ocfs2:controld": Object does not exist > May 14 15:22:20 ocfs2_controld: last message repeated 199 times > May 14 15:22:25 gw05 ocfs2_controld: Unable to open checkpoint > "ocfs2:controld": Object does not exist > May 14 15:22:26 ocfs2_controld: last message repeated 199 times > May 14 15:22:31 gw05 ocfs2_controld: Unable to open checkpoint > "ocfs2:controld": Object does not exist > May 14 15:22:32 ocfs2_controld: last message repeated 199 times > May 14 15:22:37 gw05 ocfs2_controld: Unable to open checkpoint > "ocfs2:controld": Object does not exist > May 14 15:22:38 ocfs2_controld: last message repeated 199 times > > One other interesting bit of log (well, to me), was this bit that > occurred when I tried to manually mount the OCFS2 store on the afflicted > server: > > mount.ocfs2: Unable to access cluster service while trying to join > the group > > One other note - I discovered I had not specified a monitor for either > the pacemaker:o2cb or the pacemaker:controld RA. Could that have > possibly triggered this issue? > > -- > > Sincerely, > Matthew O'Connor > > ----------------------------------------------------------------- > Sr. Software Engineer > PGP/GPG Key: 0x55F981C4 > Fingerprint: E5DC A0F8 5A40 E4DA 2CE6 B5A2 014C 2CBF 55F9 81C4 > > Engineering and Computer Simulations, Inc. > 11825 High Tech Ave Suite 250 > Orlando, FL 32817 > > Tel: 407-823-9991 x315 > Fax: 407-823-8299 > Email: m...@ecsorl.com > Web: www.ecsorl.com > ----------------------------------------------------------------- > > CONFIDENTIAL NOTICE: The information contained in this electronic > message is legally privileged, confidential and exempt from disclosure > under applicable law. It is intended only for the use of the individual > or entity named above. If the reader of this message is not the intended > recipient, you are hereby notified that any dissemination, distribution > or copying of this message is strictly prohibited. If you have received > this communication in error, please notify the sender immediately by > return e-mail and delete the original message and any copies of it from > your computer system. Thank you. > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org