On Tue, May 25, 2010 at 3:39 PM, Dejan Muhamedagic <deja...@fastmail.fm>wrote:
> Hi, > [snip] > > So I presume the problem could be caused by the fact that the second part > is > > a clone and not a resource? or a bug? > > I can eventually send the whole config. > > Looks like a bug to me. Clone or resource, constraints should be > observed. Perhaps it's a duplicate of this one: > http://developerbugs.linux-foundation.org/show_bug.cgi?id=2422 > > I don't know if it is the same bug... In the sense that in your bug case the problem is related to: - the stop operation not enforcing the order constraint - the constraint is between a clone of a group and a clone of a resource In my case the problem arises in: - the start operation not enforcing the order constraint - the constraint is between a group and a clone of a group Verified also when - the constraint is between a resource and a clone of a group Clone of a group is the second part of the two. Not tested in reverse. So probably a common root cause for the problems, if exists, could be related with the clone part... In the meantime, in my case, I put this workaround: I set my nfsclient resource with the mount options "bg,intr", the important part here being the bg one. But I think in general also the intr option could be useful in case the cluster needs to do some operations (no soft option, so in my case hard should be the default, avoiding possible data corruption, correct?) So now it is: primitive nfsclient ocf:heartbeat:Filesystem \ params device="nfssrv:/nfsdata/web" directory="/usr/local/data" fstype="nfs" options="intr,bg" \ op monitor interval="30" timeout="40" depth="0" \ op start interval="0" timeout="60" \ op stop interval="0" timeout="60" this way the start of the nfsclient succeeds (in terms of return codes ;-) then the first monitor for nfsclient fails and then the consequent stop/start part of nfsclient succeeds, so that I can start successfully the resources... See below the related lines that show the behavior. At the end my cluster status, after I start corosync on one node (the other node being powered on but with corosync stopped) is now correctly: r...@node1 ~]# crm_mon -fr1 ============ Last updated: Thu May 27 17:56:52 2010 Stack: openais Current DC: node1. - partition WITHOUT quorum Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7 2 Nodes configured, 2 expected votes 4 Resources configured. ============ Online: [ node1. ] OFFLINE: [ node2. ] Full list of resources: Master/Slave Set: NfsData Masters: [ node1. ] Stopped: [ nfsdrbd:1 ] Resource Group: nfs-group lv_nfsdata_drbd (ocf::heartbeat:LVM): Started node1. NfsFS (ocf::heartbeat:Filesystem): Started node1. VIPlbtest (ocf::heartbeat:IPaddr2): Started node1. nfssrv (ocf::heartbeat:nfsserver): Started node1. Clone Set: cl-pinggw Started: [ node1. ] Stopped: [ pinggw:1 ] Clone Set: apache_clone Started: [ node1. ] Stopped: [ apache_group:1 ] Migration summary: * Node node1.: pingd=200 nfsclient:0: migration-threshold=1000000 fail-count=1 May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Promote nfsdrbd:0 (Slave -> Master node1.) May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource nfsdrbd:1 (Stopped) May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource lv_nfsdata_drbd (Stopped) May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource NfsFS (Stopped) May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource VIPlbtest (Stopped) May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource nfssrv (Stopped) May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource pinggw:0 (Started node1.) May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource pinggw:1 (Stopped) May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Start nfsclient:0 (node1.) May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Start apachepoli:0 (node1.) May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource nfsclient:1 (Stopped) May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource apachepoli:1 (Stopped) ... May 27 17:26:30 node1 lrmd: [32371]: info: rsc:nfsdrbd:0:14: notify ... May 27 17:26:30 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing key=43:2:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_start_0 ) May 27 17:26:30 node1 lrmd: [32371]: info: rsc:nfsclient:0:15: start May 27 17:26:30 node1 Filesystem[32729]: INFO: Running start for viplbtest.:/nfsdata/web on /usr/local/data May 27 17:26:30 node1 crmd: [32374]: info: process_lrm_event: LRM operation nfsdrbd:0_notify_0 (call=14, rc=0, cib-update=41, confirmed=true) ok May 27 17:26:30 node1 crmd: [32374]: info: match_graph_event: Action nfsdrbd:0_pre_notify_promote_0 (85) confirmed on node1. (rc=0) May 27 17:26:30 node1 crmd: [32374]: info: te_pseudo_action: Pseudo action 23 fired and confirmed ... May 27 17:26:30 node1 kernel: block drbd0: role( Secondary -> Primary ) May 27 17:26:30 node1 lrmd: [32371]: info: RA output: (nfsdrbd:0:promote:stdout) May 27 17:26:30 node1 crmd: [32374]: info: process_lrm_event: LRM operation nfsdrbd:0_promote_0 (call=16, rc=0, cib-update=42, confirmed=true) ok May 27 17:26:30 node1 crmd: [32374]: info: match_graph_event: Action nfsdrbd:0_promote_0 (7) confirmed on node1. (rc=0) ... May 27 17:26:43 node1 crmd: [32374]: info: process_lrm_event: LRM operation nfsclient:0_start_0 (call=15, rc=0, cib-update=45, confirmed=true) ok May 27 17:26:43 node1 crmd: [32374]: info: match_graph_event: Action nfsclient:0_start_0 (43) confirmed on node1. (rc=0) ... May 27 17:26:43 node1 pengine: [32373]: info: determine_online_status: Node node1. is online May 27 17:26:43 node1 pengine: [32373]: notice: clone_print: Master/Slave Set: NfsData May 27 17:26:43 node1 pengine: [32373]: notice: short_print: Masters: [ node1. ] May 27 17:26:43 node1 pengine: [32373]: notice: short_print: Stopped: [ nfsdrbd:1 ] May 27 17:26:43 node1 pengine: [32373]: notice: group_print: Resource Group: nfs-group May 27 17:26:43 node1 pengine: [32373]: notice: native_print: lv_nfsdata_drbd (ocf::heartbeat:LVM): Stopped May 27 17:26:43 node1 pengine: [32373]: notice: native_print: NfsFS (ocf::heartbeat:Filesystem): Stopped May 27 17:26:43 node1 pengine: [32373]: notice: native_print: VIPlbtest (ocf::heartbeat:IPaddr2): Stopped May 27 17:26:43 node1 pengine: [32373]: notice: native_print: nfssrv (ocf::heartbeat:nfsserver): Stopped May 27 17:26:43 node1 pengine: [32373]: notice: clone_print: Clone Set: cl-pinggw May 27 17:26:43 node1 pengine: [32373]: notice: short_print: Started: [ node1. ] May 27 17:26:43 node1 pengine: [32373]: notice: short_print: Stopped: [ pinggw:1 ] May 27 17:26:43 node1 pengine: [32373]: notice: clone_print: Clone Set: apache_clone May 27 17:26:43 node1 pengine: [32373]: notice: group_print: Resource Group: apache_group:0 May 27 17:26:43 node1 pengine: [32373]: notice: native_print: nfsclient:0 (ocf::heartbeat:Filesystem): Started node1. May 27 17:26:43 node1 pengine: [32373]: notice: native_print: apachepoli:0 (ocf::poli:apache2): Stopped May 27 17:26:43 node1 pengine: [32373]: notice: short_print: Stopped: [ apache_group:1 ] ... May 27 17:26:43 node1 pengine: [32373]: info: master_color: Promoting nfsdrbd:0 (Master node1.) May 27 17:26:43 node1 pengine: [32373]: info: master_color: NfsData: Promoted 1 instances of a possible 1 to master ... May 27 17:26:43 node1 lrmd: [32371]: info: rsc:lv_nfsdata_drbd:18: start May 27 17:26:43 node1 crmd: [32374]: info: te_rsc_command: Initiating action 51: monitor nfsclient:0_monitor_30000 on node1. (local) May 27 17:26:43 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing key=51:3:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_monitor_30000 ) May 27 17:26:43 node1 lrmd: [32371]: info: rsc:nfsclient:0:19: monitor May 27 17:26:43 node1 LVM[411]: INFO: Activating volume group vg_nfsdata_drbd May 27 17:26:43 node1 crmd: [32374]: info: process_lrm_event: LRM operation nfsclient:0_monitor_30000 (call=19, rc=7, cib-update=47, confirmed=false) not running May 27 17:26:43 node1 crmd: [32374]: WARN: status_from_rc: Action 51 (nfsclient:0_monitor_30000) on node1. failed (target: 0 vs. rc: 7): Error May 27 17:26:43 node1 crmd: [32374]: WARN: update_failcount: Updating failcount for nfsclient:0 on node1. after failed monitor: rc=7 (update=value++, time=1274974003) ... May 27 17:26:43 node1 crmd: [32374]: info: match_graph_event: Action nfsclient:0_monitor_30000 (51) confirmed on node1. (rc=4) May 27 17:26:43 node1 attrd: [32372]: info: find_hash_entry: Creating hash entry for fail-count-nfsclient:0 May 27 17:26:43 node1 attrd: [32372]: info: attrd_local_callback: Expanded fail-count-nfsclient:0=value++ to 1 May 27 17:26:43 node1 attrd: [32372]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nfsclient:0 (1) May 27 17:26:43 node1 attrd: [32372]: info: attrd_perform_update: Sent update 17: fail-count-nfsclient:0=1 ... May 27 17:26:43 node1 LVM[411]: INFO: Reading all physical volumes. This may take a while... Found volume group "VolGroup00" using metadata type lvm2 Found volume group "vg_nfsdata_drbd" using metadata type lvm2 May 27 17:26:44 node1 LVM[411]: INFO: 1 logical volume(s) in volume group "vg_nfsdata_drbd" now active May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation lv_nfsdata_drbd_start_0 (call=18, rc=0, cib-update=48, confirmed=true) ok May 27 17:26:44 node1 crmd: [32374]: info: match_graph_event: Action lv_nfsdata_drbd_start_0 (33) confirmed on node1. (rc=0) ... May 27 17:26:44 node1 pengine: [32373]: info: determine_online_status: Node node1. is online May 27 17:26:44 node1 pengine: [32373]: WARN: unpack_rsc_op: Processing failed op nfsclient:0_monitor_30000 on node1.: not running (7) May 27 17:26:44 node1 pengine: [32373]: notice: clone_print: Master/Slave Set: NfsData May 27 17:26:44 node1 pengine: [32373]: notice: short_print: Masters: [ node1. ] May 27 17:26:44 node1 pengine: [32373]: notice: short_print: Stopped: [ nfsdrbd:1 ] May 27 17:26:44 node1 pengine: [32373]: notice: group_print: Resource Group: nfs-group May 27 17:26:44 node1 pengine: [32373]: notice: native_print: lv_nfsdata_drbd (ocf::heartbeat:LVM): Started node1. May 27 17:26:44 node1 pengine: [32373]: notice: native_print: NfsFS (ocf::heartbeat:Filesystem): Stopped May 27 17:26:44 node1 pengine: [32373]: notice: native_print: VIPlbtest (ocf::heartbeat:IPaddr2): Stopped May 27 17:26:44 node1 pengine: [32373]: notice: native_print: nfssrv (ocf::heartbeat:nfsserver): Stopped May 27 17:26:44 node1 pengine: [32373]: notice: clone_print: Clone Set: cl-pinggw May 27 17:26:44 node1 pengine: [32373]: notice: short_print: Started: [ node1. ] May 27 17:26:44 node1 pengine: [32373]: notice: short_print: Stopped: [ pinggw:1 ] May 27 17:26:44 node1 pengine: [32373]: notice: clone_print: Clone Set: apache_clone May 27 17:26:44 node1 pengine: [32373]: notice: group_print: Resource Group: apache_group:0 May 27 17:26:44 node1 pengine: [32373]: notice: native_print: nfsclient:0 (ocf::heartbeat:Filesystem): Started node1. FAILED May 27 17:26:44 node1 pengine: [32373]: notice: native_print: apachepoli:0 (ocf::poli:apache2): Stopped May 27 17:26:44 node1 pengine: [32373]: notice: short_print: Stopped: [ apache_group:1 ] May 27 17:26:44 node1 pengine: [32373]: info: get_failcount: apache_clone has failed 1 times on node1. May 27 17:26:44 node1 pengine: [32373]: notice: common_apply_stickiness: apache_clone can fail 999999 more times on node1. before being forced off ... May 27 17:26:44 node1 pengine: [32373]: notice: LogActions: Recover resource nfsclient:0 (Started node1.) May 27 17:26:44 node1 pengine: [32373]: notice: LogActions: Start apachepoli:0 (node1.) ... May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action 38: start NfsFS_start_0 on node1. (local) May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing key=38:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=NfsFS_start_0 ) May 27 17:26:44 node1 lrmd: [32371]: info: rsc:NfsFS:21: start ... May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action 3: stop nfsclient:0_stop_0 on node1. (local) ... May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing key=3:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_stop_0 ) May 27 17:26:44 node1 lrmd: [32371]: info: rsc:nfsclient:0:22: stop ... May 27 17:26:44 node1 Filesystem[493]: INFO: Running start for /dev/vg_nfsdata_drbd/lv_nfsdata_drbd on /nfsdata May 27 17:26:44 node1 Filesystem[502]: INFO: Running stop for viplbtest.:/nfsdata/web on /usr/local/data May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation nfsclient:0_stop_0 (call=22, rc=0, cib-update=50, confirmed=true) ok May 27 17:26:44 node1 crmd: [32374]: info: match_graph_event: Action nfsclient:0_stop_0 (3) confirmed on node1. (rc=0) ... May 27 17:26:44 node1 kernel: kjournald starting. Commit interval 5 seconds May 27 17:26:44 node1 kernel: EXT3-fs warning: maximal mount count reached, running e2fsck is recommended May 27 17:26:44 node1 kernel: EXT3 FS on dm-0, internal journal May 27 17:26:44 node1 kernel: EXT3-fs: mounted filesystem with ordered data mode. May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation NfsFS_start_0 (call=21, rc=0, cib-update=51, confirmed=true) ok ... May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action 39: start VIPlbtest_start_0 on node1. (local) May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing key=39:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=VIPlbtest_start_0 ) May 27 17:26:44 node1 lrmd: [32371]: info: rsc:VIPlbtest:23: start ... May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation VIPlbtest_start_0 (call=23, rc=0, cib-update=53, confirmed=true) ok ... May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action 41: start nfssrv_start_0 on node1. (local) May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing key=41:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfssrv_start_0 ) May 27 17:26:44 node1 lrmd: [32371]: info: rsc:nfssrv:25: start ... May 27 17:26:44 node1 nfsserver[661]: INFO: Starting NFS server ... May 27 17:26:44 node1 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May 27 17:26:44 node1 kernel: NFSD: starting 90-second grace period May 27 17:26:44 node1 lrmd: [32371]: info: RA output: (nfssrv:start:stdout) rpc.statd version 1.0.9 May 27 17:26:44 node1 nfsserver[661]: INFO: NFS server started May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation nfssrv_start_0 (call=25, rc=0, cib-update=55, confirmed=true) ok ... May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action 52: start nfsclient:0_start_0 on node1. (local) May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing key=52:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_start_0 ) May 27 17:26:44 node1 lrmd: [32371]: info: rsc:nfsclient:0:26: start May 27 17:26:45 node1 Filesystem[771]: INFO: Running start for viplbtest.:/nfsdata/web on /usr/local/data May 27 17:26:46 node1 mountd[746]: authenticated mount request from 10.4.4.140:921 for /nfsdata/web (/nfsdata/web) May 27 17:26:46 node1 mountd[746]: authenticated mount request from 10.4.4.140:999 for /nfsdata/web (/nfsdata/web) May 27 17:26:46 node1 crmd: [32374]: info: process_lrm_event: LRM operation nfsclient:0_start_0 (call=26, rc=0, cib-update=56, confirmed=true) ok May 27 17:26:46 node1 crmd: [32374]: info: match_graph_event: Action nfsclient:0_start_0 (52) confirmed on node1. (rc=0)
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf