Andrew Beekhof <[email protected]> writes:

> On 10/05/2013, at 11:37 PM, Ferenc Wagner <[email protected]> wrote:
>
>> An hour ago one node (n02) of our 4-node cluster started to shutdown.
>
> Someone, probably the init script, sent SIGTERM to pacemakerd.

Hi Andrew,

thanks for the reply!  Here I actually meant a full system shutdown, not
a Pacemaker shutdown.  To quote the previous part of our logs (the first
three quoted messages are usual in normal system operation):

May 10 13:59:15 n02 lrmd: [10857]: debug: rsc:dlm:2 monitor[12] (pid 524)
May 10 13:59:15 n02 lrmd: [10857]: info: operation monitor[12] on dlm:2 for 
client 10860: pid 524 exited with return code 0
May 10 13:59:18 n02 corosync[10779]:   [QUORUM] got quorate request on 0x187de10
May 10 13:59:41 n02 shutdown[657]: shutting down for system halt
May 10 13:59:41 n02 init: Switching to runlevel: 0
May 10 13:59:41 n02 shutdown[674]: shutting down for system halt
May 10 13:59:42 n02 logd: [786]: debug: Stopping ha_logd with pid 6449
May 10 13:59:42 n02 logd: [786]: info: Waiting for pid=6449 to exit
May 10 13:59:42 n02 logd: [6449]: debug: logd_term_action: received SIGTERM
May 10 13:59:42 n02 logd: [6449]: debug: logd_term_action: waiting for 0 
messages to be read by write process
May 10 13:59:42 n02 logd: [6449]: debug: logd_term_action: sending SIGTERM to 
write process
May 10 13:59:42 n02 logd: [6484]: info: logd_term_write_action: received SIGTERM
May 10 13:59:42 n02 logd: [6484]: debug: Writing out 0 messages then quitting
May 10 13:59:42 n02 logd: [6484]: info: Exiting write process
May 10 13:59:42 n02 stunnel: LOG5[32668:139795269060352]: Terminated
May 10 13:59:42 n02 nrpe[6497]: Caught SIGTERM - shutting down...
May 10 13:59:42 n02 nrpe[6497]: Daemon shutdown
May 10 13:59:42 n02 rsyslogd-2177: imuxsock lost 4228 messages from pid 11083 
due to rate-limiting
May 10 13:59:42 n02 lvm[11083]: Got new connection on fd 5
May 10 13:59:42 n02 corosync[10779]:   [QUORUM] got quorate request on 0x187de10

>> No idea why.

The "shutting down for system halt" message is repeated, which makes me
suspicious of some BMC/ACPI malfunction, but I can find nothing in the
BMC event log, so this will be hard to confirm...

>> But during shutdown, it asked another node (n01) to shut down as
>> well:
>
> No it didn't.  It asked n01 to perform an orderly shutdown (resources
> first, then the pacemaker daemons) of n02.

This would make perfect sense, and n01 certainly started to migrate the
resources from n02:

May 10 13:59:42 n01 pengine: [15536]: notice: stage6: Scheduling Node n02 for 
shutdown
May 10 13:59:42 n01 pengine: [15536]: notice: LogActions: Stop    dlm:2#011(n02)
[...]
May 10 13:59:42 n01 pengine: [15536]: notice: LogActions: Migrate 
vm-alder#011(Started n02 -> n04)

But then suddenly:

May 10 13:59:45 n01 shutdown[31186]: shutting down for system halt
May 10 13:59:45 n01 init: Switching to runlevel: 0
[...]

If I understand you right, this can't be the result of some Pacemaker
action, which would be reassuring, as it does not make any sense to me.
So it's again those pesky goblins to blame, who spent the lunch break
throwing the power switches in our Fujutsu CX400 cluster box in the
server room. :-/

I wonder what should happen in such a situation, anyway.  As n01 was not
running any cluster resources but clones, it shut down quickly, while
n02 was stuck, trying to migrate some resources (big virtual machines)
away.  But I couldn't log onto it anymore, even the serial console did
not give a login prompt, possibly because it also shut down, mostly.
But the cluster was quorate, missing only n01...  I'll dig deeper into
the logs, but what should I expect?

>> May 10 13:59:42 n02 pacemakerd: [10851]: info: crm_signal_dispatch: Invoking 
>> handler for signal 15: Terminated
>> May 10 13:59:42 n02 pacemakerd: [10851]: notice: pcmk_shutdown_worker: 
>> Shuting down Pacemaker
>> May 10 13:59:42 n02 pacemakerd: [10851]: notice: stop_child: Stopping crmd: 
>> Sent -15 to process 10860
>> May 10 13:59:42 n02 crmd: [10860]: info: crm_signal_dispatch: Invoking 
>> handler for signal 15: Terminated
>> May 10 13:59:42 n02 crmd: [10860]: notice: crm_shutdown: Requesting 
>> shutdown, upper limit is 1200000ms
>> May 10 13:59:42 n02 crmd: [10860]: debug: crm_timer_start: Started Shutdown 
>> Escalation (I_STOP:1200000ms), src=50
>> May 10 13:59:42 n02 crmd: [10860]: debug: s_crmd_fsa: Processing I_SHUTDOWN: 
>> [ state=S_NOT_DC cause=C_SHUTDOWN origin=crm_shutdown ]
>> May 10 13:59:42 n02 crmd: [10860]: debug: do_fsa_action: actions:trace: 
>> #011// A_SHUTDOWN_REQ
>> May 10 13:59:42 n02 crmd: [10860]: info: do_shutdown_req: Sending shutdown 
>> request to n01
>> 
>> Then hell broke loose, and I'm still pondering over the logs,
>
> Look for resource stop actions that failed.

I probably disrupted the last migration operations by resetting n02, but
otherwise the cluster recovered really nicely after I managed to start
all nodes again.  Thus stop actions did not fail, only migrate_froms.

Another question surfaced though, by the apparent lack of monitoring:
how are the action timeouts and intervals specified in the resource
agent meta data used by Pacemaker?  During configuration I got some
warnings about unspecified timeout values being below the recommended
values; is that all?  No default timeouts are taken from the RA
metadata, but some Pacemaker default is used instead?  No monitor action
is run at all if I don't specify one explicitly?  Looks like that, but I
wonder what that meta data is used for, exactly.
-- 
Thanks,
Feri.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to