On Thu, May 27, 2010 at 6:41 PM, Gianluca Cecchi <gianluca.cec...@gmail.com> wrote: > On Tue, May 25, 2010 at 3:39 PM, Dejan Muhamedagic <deja...@fastmail.fm> > wrote: >> >> Hi, >> [snip] >> > So I presume the problem could be caused by the fact that the second >> > part is >> > a clone and not a resource? or a bug? >> > I can eventually send the whole config. >> >> Looks like a bug to me. Clone or resource, constraints should be >> observed. Perhaps it's a duplicate of this one: >> http://developerbugs.linux-foundation.org/show_bug.cgi?id=2422 >> > > I don't know if it is the same bug... > In the sense that in your bug case the problem is related to: > - the stop operation not enforcing the order constraint > - the constraint is between a clone of a group and a clone of a resource > In my case the problem arises in: > - the start operation not enforcing the order constraint > - the constraint is between a group and a clone of a group > Verified also when > - the constraint is between a resource and a clone of a group
Can you submit a bug report and include a hb_report for this scenario please? > Clone of a group is the second part of the two. Not tested in reverse. > So probably a common root cause for the problems, if exists, could be > related with the clone part... > In the meantime, in my case, I put this workaround: > I set my nfsclient resource with the mount options "bg,intr", > the important part here being the bg one. > But I think in general also the intr option could be useful in case the > cluster needs to do some operations (no soft option, so in my case hard > should be the default, avoiding possible data corruption, correct?) > So now it is: > primitive nfsclient ocf:heartbeat:Filesystem \ > params device="nfssrv:/nfsdata/web" directory="/usr/local/data" > fstype="nfs" options="intr,bg" \ > op monitor interval="30" timeout="40" depth="0" \ > op start interval="0" timeout="60" \ > op stop interval="0" timeout="60" > this way the start of the nfsclient succeeds (in terms of return codes ;-) > then the first monitor for nfsclient fails and then the consequent > stop/start part of nfsclient succeeds, so that I can start successfully the > resources... > See below the related lines that show the behavior. > At the end my cluster status, after I start corosync on one node (the other > node being powered on but with corosync stopped) is now correctly: > r...@node1 ~]# crm_mon -fr1 > ============ > Last updated: Thu May 27 17:56:52 2010 > Stack: openais > Current DC: node1. - partition WITHOUT quorum > Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7 > 2 Nodes configured, 2 expected votes > 4 Resources configured. > ============ > Online: [ node1. ] > OFFLINE: [ node2. ] > Full list of resources: > Master/Slave Set: NfsData > Masters: [ node1. ] > Stopped: [ nfsdrbd:1 ] > Resource Group: nfs-group > lv_nfsdata_drbd (ocf::heartbeat:LVM): Started > node1. > NfsFS (ocf::heartbeat:Filesystem): Started > node1. > VIPlbtest (ocf::heartbeat:IPaddr2): Started > node1. > nfssrv (ocf::heartbeat:nfsserver): Started > node1. > Clone Set: cl-pinggw > Started: [ node1. ] > Stopped: [ pinggw:1 ] > Clone Set: apache_clone > Started: [ node1. ] > Stopped: [ apache_group:1 ] > Migration summary: > * Node node1.: pingd=200 > nfsclient:0: migration-threshold=1000000 fail-count=1 > May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Promote > nfsdrbd:0 (Slave -> Master node1.) > May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource > nfsdrbd:1 (Stopped) > May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource > lv_nfsdata_drbd (Stopped) > May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource > NfsFS (Stopped) > May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource > VIPlbtest (Stopped) > May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource > nfssrv (Stopped) > May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource > pinggw:0 (Started node1.) > May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource > pinggw:1 (Stopped) > May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Start > nfsclient:0 (node1.) > May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Start > apachepoli:0 (node1.) > May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource > nfsclient:1 (Stopped) > May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource > apachepoli:1 (Stopped) > ... > May 27 17:26:30 node1 lrmd: [32371]: info: rsc:nfsdrbd:0:14: notify > ... > May 27 17:26:30 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing > key=43:2:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_start_0 ) > May 27 17:26:30 node1 lrmd: [32371]: info: rsc:nfsclient:0:15: start > May 27 17:26:30 node1 Filesystem[32729]: INFO: Running start for > viplbtest.:/nfsdata/web on /usr/local/data > May 27 17:26:30 node1 crmd: [32374]: info: process_lrm_event: LRM operation > nfsdrbd:0_notify_0 (call=14, rc=0, cib-update=41, confirmed=true) ok > May 27 17:26:30 node1 crmd: [32374]: info: match_graph_event: Action > nfsdrbd:0_pre_notify_promote_0 (85) confirmed on node1. (rc=0) > May 27 17:26:30 node1 crmd: [32374]: info: te_pseudo_action: Pseudo action > 23 fired and confirmed > ... > May 27 17:26:30 node1 kernel: block drbd0: role( Secondary -> Primary ) > May 27 17:26:30 node1 lrmd: [32371]: info: RA output: > (nfsdrbd:0:promote:stdout) > May 27 17:26:30 node1 crmd: [32374]: info: process_lrm_event: LRM operation > nfsdrbd:0_promote_0 (call=16, rc=0, cib-update=42, confirmed=true) ok > May 27 17:26:30 node1 crmd: [32374]: info: match_graph_event: Action > nfsdrbd:0_promote_0 (7) confirmed on node1. (rc=0) > ... > May 27 17:26:43 node1 crmd: [32374]: info: process_lrm_event: LRM operation > nfsclient:0_start_0 (call=15, rc=0, cib-update=45, confirmed=true) ok > May 27 17:26:43 node1 crmd: [32374]: info: match_graph_event: Action > nfsclient:0_start_0 (43) confirmed on node1. (rc=0) > ... > May 27 17:26:43 node1 pengine: [32373]: info: determine_online_status: Node > node1. is online > May 27 17:26:43 node1 pengine: [32373]: notice: clone_print: Master/Slave > Set: NfsData > May 27 17:26:43 node1 pengine: [32373]: notice: short_print: Masters: [ > node1. ] > May 27 17:26:43 node1 pengine: [32373]: notice: short_print: Stopped: [ > nfsdrbd:1 ] > May 27 17:26:43 node1 pengine: [32373]: notice: group_print: Resource > Group: nfs-group > May 27 17:26:43 node1 pengine: [32373]: notice: native_print: > lv_nfsdata_drbd (ocf::heartbeat:LVM): Stopped > May 27 17:26:43 node1 pengine: [32373]: notice: native_print: NfsFS > (ocf::heartbeat:Filesystem): Stopped > May 27 17:26:43 node1 pengine: [32373]: notice: native_print: VIPlbtest > (ocf::heartbeat:IPaddr2): Stopped > May 27 17:26:43 node1 pengine: [32373]: notice: native_print: nfssrv > (ocf::heartbeat:nfsserver): Stopped > May 27 17:26:43 node1 pengine: [32373]: notice: clone_print: Clone Set: > cl-pinggw > May 27 17:26:43 node1 pengine: [32373]: notice: short_print: Started: [ > node1. ] > May 27 17:26:43 node1 pengine: [32373]: notice: short_print: Stopped: [ > pinggw:1 ] > May 27 17:26:43 node1 pengine: [32373]: notice: clone_print: Clone Set: > apache_clone > May 27 17:26:43 node1 pengine: [32373]: notice: group_print: Resource > Group: apache_group:0 > May 27 17:26:43 node1 pengine: [32373]: notice: native_print: > nfsclient:0 (ocf::heartbeat:Filesystem): Started node1. > May 27 17:26:43 node1 pengine: [32373]: notice: native_print: > apachepoli:0 (ocf::poli:apache2): Stopped > May 27 17:26:43 node1 pengine: [32373]: notice: short_print: Stopped: [ > apache_group:1 ] > ... > May 27 17:26:43 node1 pengine: [32373]: info: master_color: Promoting > nfsdrbd:0 (Master node1.) > May 27 17:26:43 node1 pengine: [32373]: info: master_color: NfsData: > Promoted 1 instances of a possible 1 to master > ... > May 27 17:26:43 node1 lrmd: [32371]: info: rsc:lv_nfsdata_drbd:18: start > May 27 17:26:43 node1 crmd: [32374]: info: te_rsc_command: Initiating action > 51: monitor nfsclient:0_monitor_30000 on node1. (local) > May 27 17:26:43 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing > key=51:3:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_monitor_30000 > ) > May 27 17:26:43 node1 lrmd: [32371]: info: rsc:nfsclient:0:19: monitor > May 27 17:26:43 node1 LVM[411]: INFO: Activating volume group > vg_nfsdata_drbd > May 27 17:26:43 node1 crmd: [32374]: info: process_lrm_event: LRM operation > nfsclient:0_monitor_30000 (call=19, rc=7, cib-update=47, confirmed=false) > not running > May 27 17:26:43 node1 crmd: [32374]: WARN: status_from_rc: Action 51 > (nfsclient:0_monitor_30000) on node1. failed (target: 0 vs. rc: 7): Error > May 27 17:26:43 node1 crmd: [32374]: WARN: update_failcount: Updating > failcount for nfsclient:0 on node1. after failed monitor: rc=7 > (update=value++, time=1274974003) > ... > May 27 17:26:43 node1 crmd: [32374]: info: match_graph_event: Action > nfsclient:0_monitor_30000 (51) confirmed on node1. (rc=4) > May 27 17:26:43 node1 attrd: [32372]: info: find_hash_entry: Creating hash > entry for fail-count-nfsclient:0 > May 27 17:26:43 node1 attrd: [32372]: info: attrd_local_callback: Expanded > fail-count-nfsclient:0=value++ to 1 > May 27 17:26:43 node1 attrd: [32372]: info: attrd_trigger_update: Sending > flush op to all hosts for: fail-count-nfsclient:0 (1) > May 27 17:26:43 node1 attrd: [32372]: info: attrd_perform_update: Sent > update 17: fail-count-nfsclient:0=1 > ... > May 27 17:26:43 node1 LVM[411]: INFO: Reading all physical volumes. This may > take a while... Found volume group "VolGroup00" using metadata type lvm2 > Found volume group "vg_nfsdata_drbd" using metadata type lvm2 > May 27 17:26:44 node1 LVM[411]: INFO: 1 logical volume(s) in volume group > "vg_nfsdata_drbd" now active > May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation > lv_nfsdata_drbd_start_0 (call=18, rc=0, cib-update=48, confirmed=true) ok > May 27 17:26:44 node1 crmd: [32374]: info: match_graph_event: Action > lv_nfsdata_drbd_start_0 (33) confirmed on node1. (rc=0) > ... > May 27 17:26:44 node1 pengine: [32373]: info: determine_online_status: Node > node1. is online > May 27 17:26:44 node1 pengine: [32373]: WARN: unpack_rsc_op: Processing > failed op nfsclient:0_monitor_30000 on node1.: not running (7) > May 27 17:26:44 node1 pengine: [32373]: notice: clone_print: Master/Slave > Set: NfsData > May 27 17:26:44 node1 pengine: [32373]: notice: short_print: Masters: [ > node1. ] > May 27 17:26:44 node1 pengine: [32373]: notice: short_print: Stopped: [ > nfsdrbd:1 ] > May 27 17:26:44 node1 pengine: [32373]: notice: group_print: Resource > Group: nfs-group > May 27 17:26:44 node1 pengine: [32373]: notice: native_print: > lv_nfsdata_drbd (ocf::heartbeat:LVM): Started node1. > May 27 17:26:44 node1 pengine: [32373]: notice: native_print: NfsFS > (ocf::heartbeat:Filesystem): Stopped > May 27 17:26:44 node1 pengine: [32373]: notice: native_print: VIPlbtest > (ocf::heartbeat:IPaddr2): Stopped > May 27 17:26:44 node1 pengine: [32373]: notice: native_print: nfssrv > (ocf::heartbeat:nfsserver): Stopped > May 27 17:26:44 node1 pengine: [32373]: notice: clone_print: Clone Set: > cl-pinggw > May 27 17:26:44 node1 pengine: [32373]: notice: short_print: Started: [ > node1. ] > May 27 17:26:44 node1 pengine: [32373]: notice: short_print: Stopped: [ > pinggw:1 ] > May 27 17:26:44 node1 pengine: [32373]: notice: clone_print: Clone Set: > apache_clone > May 27 17:26:44 node1 pengine: [32373]: notice: group_print: Resource > Group: apache_group:0 > May 27 17:26:44 node1 pengine: [32373]: notice: native_print: > nfsclient:0 (ocf::heartbeat:Filesystem): Started node1. FAILED > May 27 17:26:44 node1 pengine: [32373]: notice: native_print: > apachepoli:0 (ocf::poli:apache2): Stopped > May 27 17:26:44 node1 pengine: [32373]: notice: short_print: Stopped: [ > apache_group:1 ] > May 27 17:26:44 node1 pengine: [32373]: info: get_failcount: apache_clone > has failed 1 times on node1. > May 27 17:26:44 node1 pengine: [32373]: notice: common_apply_stickiness: > apache_clone can fail 999999 more times on node1. before being forced off > ... > May 27 17:26:44 node1 pengine: [32373]: notice: LogActions: Recover resource > nfsclient:0 (Started node1.) > May 27 17:26:44 node1 pengine: [32373]: notice: LogActions: Start > apachepoli:0 (node1.) > ... > May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action > 38: start NfsFS_start_0 on node1. (local) > May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing > key=38:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=NfsFS_start_0 ) > May 27 17:26:44 node1 lrmd: [32371]: info: rsc:NfsFS:21: start > ... > May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action > 3: stop nfsclient:0_stop_0 on node1. (local) > ... > May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing > key=3:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_stop_0 ) > May 27 17:26:44 node1 lrmd: [32371]: info: rsc:nfsclient:0:22: stop > ... > May 27 17:26:44 node1 Filesystem[493]: INFO: Running start for > /dev/vg_nfsdata_drbd/lv_nfsdata_drbd on /nfsdata > May 27 17:26:44 node1 Filesystem[502]: INFO: Running stop for > viplbtest.:/nfsdata/web on /usr/local/data > May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation > nfsclient:0_stop_0 (call=22, rc=0, cib-update=50, confirmed=true) ok > May 27 17:26:44 node1 crmd: [32374]: info: match_graph_event: Action > nfsclient:0_stop_0 (3) confirmed on node1. (rc=0) > ... > May 27 17:26:44 node1 kernel: kjournald starting. Commit interval 5 seconds > May 27 17:26:44 node1 kernel: EXT3-fs warning: maximal mount count reached, > running e2fsck is recommended > May 27 17:26:44 node1 kernel: EXT3 FS on dm-0, internal journal > May 27 17:26:44 node1 kernel: EXT3-fs: mounted filesystem with ordered data > mode. > May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation > NfsFS_start_0 (call=21, rc=0, cib-update=51, confirmed=true) ok > ... > May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action > 39: start VIPlbtest_start_0 on node1. (local) > May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing > key=39:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=VIPlbtest_start_0 ) > May 27 17:26:44 node1 lrmd: [32371]: info: rsc:VIPlbtest:23: start > ... > May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation > VIPlbtest_start_0 (call=23, rc=0, cib-update=53, confirmed=true) ok > ... > May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action > 41: start nfssrv_start_0 on node1. (local) > May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing > key=41:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfssrv_start_0 ) > May 27 17:26:44 node1 lrmd: [32371]: info: rsc:nfssrv:25: start > ... > May 27 17:26:44 node1 nfsserver[661]: INFO: Starting NFS server ... > May 27 17:26:44 node1 kernel: NFSD: Using /var/lib/nfs/v4recovery as the > NFSv4 state recovery directory > May 27 17:26:44 node1 kernel: NFSD: starting 90-second grace period > May 27 17:26:44 node1 lrmd: [32371]: info: RA output: (nfssrv:start:stdout) > rpc.statd version 1.0.9 > May 27 17:26:44 node1 nfsserver[661]: INFO: NFS server started > May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation > nfssrv_start_0 (call=25, rc=0, cib-update=55, confirmed=true) ok > ... > May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action > 52: start nfsclient:0_start_0 on node1. (local) > May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing > key=52:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_start_0 ) > May 27 17:26:44 node1 lrmd: [32371]: info: rsc:nfsclient:0:26: start > May 27 17:26:45 node1 Filesystem[771]: INFO: Running start for > viplbtest.:/nfsdata/web on /usr/local/data > May 27 17:26:46 node1 mountd[746]: authenticated mount request from > 10.4.4.140:921 for /nfsdata/web (/nfsdata/web) > May 27 17:26:46 node1 mountd[746]: authenticated mount request from > 10.4.4.140:999 for /nfsdata/web (/nfsdata/web) > May 27 17:26:46 node1 crmd: [32374]: info: process_lrm_event: LRM operation > nfsclient:0_start_0 (call=26, rc=0, cib-update=56, confirmed=true) ok > May 27 17:26:46 node1 crmd: [32374]: info: match_graph_event: Action > nfsclient:0_start_0 (52) confirmed on node1. (rc=0) > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf