On 20 Feb 2014, at 6:06 pm, yusuke iida <yusk.i...@gmail.com> wrote:
> Hi, Andrew > > I tested in the following environments. > > KVM virtual 16 machines > CPU: 1 > memory: 2048MB > OS: RHEL6.4 > Pacemaker-1.1.11(709b36b) > corosync-2.3.2 > libqb-0.16.0 > > It looks like performance is much better on the whole. > > However, the problem to which queue overflows with some nodes during > the test of 16 nodes arose. > It happened by vm01 and vm09. > > Overflow of queue of vm01 has taken place between cib and crm_mon. > eb 20 14:21:02 [16211] vm01 cib: ( ipc.c:506 ) trace: > crm_ipcs_flush_events: Sent 40 events (729 remaining) for > 0x1cd1850[16243]: Resource temporarily unavailable (-11) > Feb 20 14:21:02 [16211] vm01 cib: ( ipc.c:515 ) > error: crm_ipcs_flush_events: Evicting slow client 0x1cd1850[16243]: > event queue reached 729 entries Who was pid 16243? Doesn't look like a pacemaker daemon. > > Overflow of queue of vm09 has taken place between cib and stonithd. > Feb 20 14:20:22 [15519] vm09 cib: ( ipc.c:506 ) > trace: crm_ipcs_flush_events: Sent 36 events (530 remaining) for > 0x105ec10[15520]: Resource temporarily unavailable (-11) > Feb 20 14:20:22 [15519] vm09 cib: ( ipc.c:515 ) > error: crm_ipcs_flush_events: Evicting slow client 0x105ec10[15520]: > event queue reached 530 entries > > Although I checked the code of the problem part, it was not understood > by which it would be solved. > > Is it less likelihood of sending a message of 100 at a time? > Does calculation of the waiting time after message transmission have a > problem? > Threshold of 500 may be too low? being 500 behind is really quite a long way. > > I attach crm_report when a problem occurs. > https://drive.google.com/file/d/0BwMFJItoO-fVeGZuWkFnZTFWTDQ/edit?usp=sharing > > Regards, > Yusuke > 2014-02-18 19:53 GMT+09:00 yusuke iida <yusk.i...@gmail.com>: >> Hi, Andrew and Digimer >> >> Thank you for the comment. >> >> I solved with reference to other mailing list about this problem. >> https://bugzilla.redhat.com/show_bug.cgi?id=880035 >> >> It seems that the kernel of my environment was old when said from the >> conclusion. >> It updated to the newest kernel now. >> kernel-2.6.32-431.5.1.el6.x86_64.rpm >> >> The following parameters are set to bridge which is letting >> communication of corosync pass now. >> As a result, "Retransmit List" no longer occur almost. >> # echo 1 > /sys/class/net/<bridge>/bridge/multicast_querier >> # echo 0 > /sys/class/net/<bridge>/bridge/multicast_snooping >> >> 2014-02-18 9:49 GMT+09:00 Andrew Beekhof <and...@beekhof.net>: >>> >>> On 31 Jan 2014, at 6:20 pm, yusuke iida <yusk.i...@gmail.com> wrote: >>> >>>> Hi, all >>>> >>>> I measure the performance of Pacemaker in the following combinations. >>>> Pacemaker-1.1.11.rc1 >>>> libqb-0.16.0 >>>> corosync-2.3.2 >>>> >>>> All nodes are KVM virtual machines. >>>> >>>> stopped the node of vm01 compulsorily from the inside, after starting 14 >>>> nodes. >>>> "virsh destroy vm01" was used for the stop. >>>> Then, in addition to the compulsorily stopped node, other nodes are >>>> separated from a cluster. >>>> >>>> The log of "Retransmit List:" is then outputted in large quantities from >>>> corosync. >>> >>> Probably best to poke the corosync guys about this. >>> >>> However, <= .11 is known to cause significant CPU usage with that many >>> nodes. >>> I can easily imagine this staving corosync of resources and causing >>> breakage. >>> >>> I would _highly_ recommend retesting with the current git master of >>> pacemaker. >>> I merged the new cib code last week which is faster by _two_ orders of >>> magnitude and uses significantly less CPU. >>> >>> I'd be interested to hear your feedback. >> Since I am very interested in this, I would like to test, although the >> problem of "Retransmit List" was solved. >> Please wait for a result a little. >> >> Thanks, >> Yusuke >> >>> >>>> >>>> What is the reason which the node in which failure has not occurred >>>> carries out "lost"? >>>> >>>> Please advise, if there is a problem in a setup in something. >>>> >>>> I attached the report when the problem occurred. >>>> https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing >>>> >>>> Regards, >>>> Yusuke >>>> -- >>>> ---------------------------------------- >>>> METRO SYSTEMS CO., LTD >>>> >>>> Yusuke Iida >>>> Mail: yusk.i...@gmail.com >>>> ---------------------------------------- >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> >> >> -- >> ---------------------------------------- >> METRO SYSTEMS CO., LTD >> >> Yusuke Iida >> Mail: yusk.i...@gmail.com >> ---------------------------------------- > > > > -- > ---------------------------------------- > METRO SYSTEMS CO., LTD > > Yusuke Iida > Mail: yusk.i...@gmail.com > ---------------------------------------- > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org