On Fri, Jul 20, 2012 at 6:39 PM, Виталий Давудов <vitaliy.davu...@vts24.ru> wrote: > Hi, David! > > Yes, you are right, I'm trying to do active call failover. I hope to achieve > 3 secs silence during the call (now it's 5 secs). If there is any kind of > directive in corosync to monitor the node more aggressively (every 1 sec), > I'll be very happy.
man corosync.conf has a few. I'm guessing you need to further tune one or more of token token_retransmit hold token_retransmits_before_loss_const join send_join consensus merge downcheck fail_recv_const > > > 19.07.2012 18:43, David Vossel пишет: > >> ----- Original Message ----- >>> >>> From: "Виталий Давудов" <vitaliy.davu...@vts24.ru> >>> To: "The Pacemaker cluster resource manager" >>> <pacemaker@oss.clusterlabs.org> >>> Sent: Thursday, July 19, 2012 8:08:12 AM >>> Subject: Re: [Pacemaker] Pengine behavior >>> >>> >>> Hi! >>> >>> I had moved my cluster from heartbeat to corosync. >>> Here corosync.conf content: >>> >>> compatibility: whitetank >>> >>> totem { >>> version: 2 >>> token: 500 >>> downcheck: 500 >>> secauth: off >>> threads: 0 >>> interface { >>> ringnumber: 0 >>> bindnetaddr: 10.10.1.0 >>> mcastaddr: 226.94.1.1 >>> mcastport: 5405 >>> } >>> } >>> >>> logging { >>> fileline: off >>> to_stderr: no >>> to_logfile: yes >>> to_syslog: yes >>> logfile: /var/log/corosync.log >>> debug: on >>> timestamp: on >>> logger_subsys { >>> subsys: AMF >>> debug: off >>> } >>> } >>> >>> amf { >>> mode: disabled >>> } >>> >>> quorum { >>> provider: corosync_votequorum >>> expected_votes: 1 >>> } >>> >>> Pacemaker configuration is not changed. >>> >>> After first node crashed in corosync.log I can see that monitoring >>> stoped at 15:15:24 (i.e. node crashed at 15:15:24 ): >>> >>> Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12: >>> monitor >>> Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10: >>> monitor >>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14: >>> monitor >>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:fs:16: monitor >>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: RA output: >>> (fs:monitor:stdout) OK >>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12: >>> monitor >>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10: >>> monitor >>> Jul 19 15:53:24 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14: >>> monitor >>> Jul 19 15:55:00 corosync [MAIN ] Corosync Cluster Engine ('1.2.7'): >>> started and ready to provide service. >>> Jul 19 15:55:00 corosync [MAIN ] Corosync built-in features: nss rdma >>> >>> On second node in corosync.log: >>> >>> Jul 19 15:53:27 corosync [TOTEM ] The token was lost in the >>> OPERATIONAL state. >>> Jul 19 15:53:27 corosync [TOTEM ] A processor failed, forming new >>> configuration. >>> Jul 19 15:53:27 corosync [TOTEM ] Receive multicast socket recv >>> buffer size (262142 bytes). >>> Jul 19 15:53:27 corosync [TOTEM ] Transmit multicast socket send >>> buffer size (262142 bytes). >>> Jul 19 15:53:27 corosync [TOTEM ] entering GATHER state from 2. >>> Jul 19 15:53:28 corosync [TOTEM ] entering GATHER state from 0. >>> >>> I.e. second node detected crash after 3 secs. >>> >>> Is there any way to reduce this amount of time? >>> >> Are you trying to do active call failover or something? How quickly do >> you need this failure detected? Are you hoping the failover will just be a >> blip in the audio? There may be a way to monitor the node more aggressively >> with some sort of ping.. but less that 3 seconds is very aggressive. >> >> I haven't dealt with trying to optimize this to the point you are probably >> needing. Hopefully someone else has some ideas. I'm sure you have more >> potential for optimization using the corosync stack though. >> >> -- Vossel >> >>> Thanks in advance for all yours hints. >>> >>> >>> 12.07.2012 10:47, Виталий Давудов пишет: >>> >>> >>> David, thanks for your answer! >>> >>> I'll try to migrate to corosync. >>> >>> 11.07.2012 22:40, David Vossel пишет: >>> >>> >>> >>> ----- Original Message ----- >>> >>> >>> From: "Виталий Давудов" <vitaliy.davu...@vts24.ru> >>> To: pacemaker@oss.clusterlabs.org >>> Sent: Wednesday, July 11, 2012 7:34:08 AM >>> Subject: [Pacemaker] Pengine behavior >>> >>> >>> Hi, list! >>> >>> I have configured cluster for voip application. >>> Here my configuration: >>> >>> # crm configure show >>> node $id="552f91eb-e70a-40a5-ac43-cb16e063fdba" freeswitch1 \ >>> attributes standby="off" >>> Ah... right here is your problem. You are using freeswitch instead of >>> Asterisk :P >>> >>> >>> >>> node $id="c86ab64d-26c4-4595-aa32-bf9d18f714e7" freeswitch2 \ >>> attributes standby="off" >>> primitive FailoverIP1 ocf:heartbeat:IPaddr2 \ >>> params iflabel="FoIP1" ip="91.211.219.142" cidr_netmask="30" >>> nic="eth1.50" \ >>> op monitor interval="1s" >>> primitive FailoverIP2 ocf:heartbeat:IPaddr2 \ >>> params iflabel="FoIP2" ip="172.30.0.1" cidr_netmask="16" >>> nic="eth1.554" \ >>> op monitor interval="1s" >>> primitive FailoverIP3 ocf:heartbeat:IPaddr2 \ >>> params iflabel="FoIP3" ip="10.18.1.1" cidr_netmask="24" >>> nic="eth1.552" \ >>> op monitor interval="1s" >>> primitive fs lsb:FSSofia \ >>> op monitor interval="1s" enabled="false" timeout="2s" >>> on-fail="standby" \ >>> meta target-role="Started" >>> group HAServices FailoverIP1 FailoverIP2 FailoverIP3 \ >>> meta target-role="Started" >>> order FS-after-IP inf: HAServices fs >>> property $id="cib-bootstrap-options" \ >>> dc-version="1.0.12-unknown" \ >>> cluster-infrastructure="Heartbeat" \ >>> stonith-enabled="false" \ >>> expected-quorum-votes="1" \ >>> no-quorum-policy="ignore" \ >>> last-lrm-refresh="1299964019" >>> rsc_defaults $id="rsc-options" \ >>> resource-stickiness="100" >>> >>> When 1-st node was crashed, then 2-nd node become active. During this >>> process in ha-debug file I found lines: >>> >>> ... >>> Jul 06 17:16:42 freeswitch1 crmd: [3385]: info: start_subsystem: >>> Starting sub-system "pengine" >>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: Invoked: >>> /usr/lib64/heartbeat/pengine >>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: main: Starting >>> pengine >>> Jul 06 17:16:46 freeswitch1 crmd: [3385]: info: do_dc_takeover: >>> Taking over DC status for this partition >>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_readwrite: >>> We are now in R/W mode >>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request: >>> Operation complete: op cib_master for section 'all' >>> (origin=local/crmd/11, version=0.391.20): ok ( >>> rc=0) >>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request: >>> Operation complete: op cib_modify for section cib >>> (origin=local/crmd/12, version=0.391.20): ok (rc >>> =0) >>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request: >>> Operation complete: op cib_modify for section crm_config >>> (origin=local/crmd/14, version=0.391.20): >>> ok (rc=0) >>> ... >>> >>> After "Starting pengine", only thru 4 seconds occured next action. >>> What happens at this time? Is it possible to reduce this time? >>> I seem to remember seeing something related to this in the code at >>> one point. I believe it is limited only to the use of heartbeat as >>> the messaging layer. After starting the pengine, the crmd sleeps >>> waiting for the pengine to start before contacting it. The sleep is >>> just a guess at how long it will take before the pengine will be up >>> and ready to accept a connection though. That's why it is so long... >>> so the gap will hopefully be large enough that no one will ever run >>> into any problems with it (I am not a big fan of this type of logic >>> at all) I'd recommend moving to corosync and seeing if this delay >>> goes away. >>> >>> -- Vossel >>> >>> >>> >>> Thanks in advance. >>> -- >>> Best regards, >>> Vitaly >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> -- >>> Best regards, >>> Vitaly >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > -- > > Best regards, > Vitaly > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org