I have filed a bug in oslo.messaging to track the issue [1] and my colleague Kirill Bespalov posted a fix for it [2].
We have checked the fix and it is working for neutron-server, l3-agent and dhcp-agent. It does not work for openvswitch-agent and metadata-agent meaning they do not stop RPC server on shutdown. But I would expect that absolute majority of fanout messages come from l3 agent and we can neglect these two. Does it coincide with your observations? Thanks, Dmitry [1] https://bugs.launchpad.net/oslo.messaging/+bug/1606213 [2] https://review.openstack.org/#/c/346732/ 2016-07-25 13:47 GMT+03:00 Dmitry Mescheryakov <[email protected]>: > Sam, > > For your case I would suggest to lower rabbit_transient_queues_ttl until > you are comfortable with volume of messages which comes during that time. > Setting the parameter to 1 will essentially replicate bahaviour of > auto_delete queues. But I would suggest not to set it that low, as > otherwise your OpenStack will suffer from the original bug. Probably a > value like 20 seconds should work in most cases. > > I think that there is a space for improvement here - we can delete reply > and fanout queues on graceful shutdown. But I am not sure if it will be > easy to implement, as it requires services (Nova, Neutron, etc.) to stop > RPC server on sigint and I don't know if they do it right now. > > I don't think we can make case with sigkill any better. Other than that, > the issue could be investigated on Neutron side, maybe number of messages > could be reduced there. > > Thanks, > > Dmitry > > 2016-07-25 9:27 GMT+03:00 Sam Morrison <[email protected]>: > >> We recently upgraded to Liberty and have come across some issues with >> queue build ups. >> >> This is due to changes in rabbit to set queue expiries as opposed to >> queue auto delete. >> See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more >> information. >> >> The fix for this bug is in liberty and it does fix an issue however it >> causes another one. >> >> Every time you restart something that has a fanout queue. Eg. >> cinder-scheduler or the neutron agents you will have >> a queue in rabbit that is still bound to the rabbitmq exchange (and so >> still getting messages in) but no consumers. >> >> These messages in these queues are basically rubbish and don’t need to >> exist. Rabbit will delete these queues after 10 mins (although the default >> in master is now changed to 30 mins) >> >> During this time the queue will grow and grow with messages. This sets >> off our nagios alerts and our ops guys have to deal with something that >> isn’t really an issue. They basically delete the queue. >> >> A bad scenario is when you make a change to your cloud that means all >> your 1000 neutron agents are restarted, this causes a couple of dead queues >> per agent to hang around. (port updates and security group updates) We get >> around 25 messages / second on these queues and so you can see after 10 >> minutes we have a ton of messages in these queues. >> >> 1000 x 2 x 25 x 600 = 30,000,000 messages in 10 minutes to be precise. >> >> Has anyone else been suffering with this before a raise a bug? >> >> Cheers, >> Sam >> >> >> _______________________________________________ >> OpenStack-operators mailing list >> [email protected] >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >> > >
_______________________________________________ OpenStack-operators mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
