----- Original Message ----- > From: "Kazunori INOUE" <kazunori.ino...@gmail.com> > To: "pm" <pacemaker@oss.clusterlabs.org> > Sent: Tuesday, December 17, 2013 5:43:53 AM > Subject: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1 > > Hi, > > When repeated 'node standby' and 'node online', lrmd crashed with > SIGSEGV because "op->id" in cancel_recurring_action() was NULL.
That's a really weird one... I don't see how it is possible for op->id to be NULL there. You might need to give valgrind a shot to detect whatever is really going on here. -- Vossel > > Dec 17 19:01:21 vm3 crmd[2433]: info: do_state_transition: State > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > cause=C_IPC_MESSAGE origin=handle_response ] > Dec 17 19:01:21 vm3 crmd[2433]: info: do_te_invoke: Processing > graph 437 (ref=pe_calc-dc-1387274481-5672) derived from > /var/lib/pacemaker/pengine/pe-input-437.bz2 > Dec 17 19:01:21 vm3 crmd[2433]: notice: te_rsc_command: Initiating > action 17: stop prmStonith4_stop_0 on vm3 (local) > Dec 17 19:01:21 vm3 crmd[2433]: info: do_lrm_rsc_op: Performing > key=17:437:0:40d7b9a2-c373-4459-a811-9c225d1a9555 > op=prmStonith4_stop_0 > Dec 17 19:01:21 vm3 lrmd[2430]: info: log_execute: executing - > rsc:prmStonith4 action:stop call_id:3487 > Dec 17 19:01:21 vm3 stonith-ng[2429]: info: stonith_command: > Processed st_device_remove from lrmd.2430: OK (0) > Dec 17 19:01:21 vm3 lrmd[2430]: info: log_finished: finished - > rsc:prmStonith4 action:stop call_id:3487 exit-code:0 exec-time:0ms > queue-time:0ms > Dec 17 19:01:21 vm3 pengine[2432]: notice: process_pe_message: > Calculated Transition 437: /var/lib/pacemaker/pengine/pe-input-437.bz2 > Dec 17 19:01:21 vm3 crmd[2433]: notice: te_rsc_command: Initiating > action 33: stop prmPg_stop_0 on vm3 (local) > Dec 17 19:01:21 vm3 lrmd[2430]: info: cancel_recurring_action: > Cancelling operation prmPg_monitor_10000 > Dec 17 19:01:21 vm3 crmd[2433]: info: do_lrm_rsc_op: Performing > key=33:437:0:40d7b9a2-c373-4459-a811-9c225d1a9555 op=prmPg_stop_0 > Dec 17 19:01:21 vm3 lrmd[2430]: info: log_execute: executing - > rsc:prmPg action:stop call_id:3489 > Dec 17 19:01:21 vm3 crmd[2433]: info: process_lrm_event: LRM > operation prmStonith4_monitor_3600000 (call=3473, status=1, > cib-update=0, confirmed=true) Cancelled > Dec 17 19:01:21 vm3 crmd[2433]: notice: process_lrm_event: LRM > operation prmStonith4_stop_0 (call=3487, rc=0, cib-update=3090, > confirmed=true) ok > Dec 17 19:01:21 vm3 crmd[2433]: info: process_lrm_event: LRM > operation prmPg_monitor_10000 (call=3485, status=1, cib-update=0, > confirmed=true) Cancelled > Dec 17 19:01:21 vm3 crmd[2433]: info: match_graph_event: Action > prmStonith4_stop_0 (17) confirmed on vm3 (rc=0) > Dec 17 19:01:21 vm3 crmd[2433]: notice: te_rsc_command: Initiating > action 40: stop prmPing_stop_0 on vm3 (local) > Dec 17 19:01:21 vm3 cib[2428]: info: cib_process_request: > Completed cib_modify operation for section status: OK (rc=0, > origin=local/crmd/3090, version=0.440.2) > Dec 17 19:01:21 vm3 stonith-ng[2429]: info: crm_client_destroy: > Destroying 0 events > Dec 17 19:01:21 vm3 pacemakerd[2424]: error: child_death_dispatch: > Managed process 2430 (lrmd) dumped core > Dec 17 19:01:21 vm3 pacemakerd[2424]: notice: pcmk_child_exit: Child > process lrmd terminated with signal 11 (pid=2430, core=1) > Dec 17 19:01:21 vm3 pacemakerd[2424]: notice: pcmk_process_exit: > Respawning failed child process: lrmd > Dec 17 19:01:21 vm3 pacemakerd[2424]: error: pcmk_process_exit: > Rebooting system > Dec 17 19:10:40 vm3 root: Mark:pcmk:1387275040 > > $ gdb /usr/libexec/pacemaker/lrmd core.2430 > (gdb) bt > #0 0x000000323f8480ac in vfprintf () from /lib64/libc.so.6 > #1 0x000000323f86f9d2 in vsnprintf () from /lib64/libc.so.6 > #2 0x0000003fcb81726d in qb_log_real_va_ (cs=0x3fcf208658, > ap=0x7ffff6f5fc80) at log.c:230 > #3 0x0000003fcb8173ea in qb_log_real_ (cs=0x3fcf208658) at log.c:255 > #4 0x0000003fcf003a9c in cancel_recurring_action (op=0xb9fae0) at > services.c:356 > #5 0x0000003fcf003bc6 in services_action_cancel (name=0xb9f350 > "prmPing", action=0xb9ee90 "monitor", interval=10000) at > services.c:381 > #6 0x0000000000406595 in cancel_op (rsc_id=0xb9f350 "prmPing", > action=0xb9ee90 "monitor", interval=10000) at lrmd.c:1197 > #7 0x00000000004067aa in process_lrmd_rsc_cancel (client=0xb926c0, > id=7030, request=0xb95ad0) at lrmd.c:1261 > #8 0x0000000000406a51 in process_lrmd_message (client=0xb926c0, > id=7030, request=0xb95ad0) at lrmd.c:1300 > #9 0x0000000000402a06 in lrmd_ipc_dispatch (c=0xb91af0, > data=0x7f9f30acbc08, size=362) at main.c:141 > #10 0x0000003fcb8126f8 in _process_request_ (c=0xb91af0, > ms_timeout=10) at ipcs.c:698 > #11 0x0000003fcb812ad5 in qb_ipcs_dispatch_connection_request (fd=5, > revents=1, data=0xb91af0) at ipcs.c:801 > #12 0x0000003fcc0327b1 in gio_read_socket (gio=0xb92880, > condition=G_IO_IN, data=0xb91138) at mainloop.c:437 > #13 0x0000003fc9c3feb2 in g_main_context_dispatch () from > /lib64/libglib-2.0.so.0 > #14 0x0000003fc9c43d68 in ?? () from /lib64/libglib-2.0.so.0 > #15 0x0000003fc9c44275 in g_main_loop_run () from /lib64/libglib-2.0.so.0 > #16 0x00000000004030cc in main (argc=1, argv=0x7ffff6f606c8) at main.c:314 > > Although I'm investigating the cause, I have not discovered yet... > > Because size was big, I put crm_report here. > https://drive.google.com/file/d/0B9eNn1AWfKD4WGY5bllMQW1BbDA/edit?usp=sharing > > Best Regards, > Kazunori INOUE > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org