Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

Fox, Kevin M Thu, 28 Jul 2016 05:49:12 -0700

It does send a sigterm and wait.

I'm saying, I'm concerned the services aren't all cleaning up after themselves 
today.

Thanks,
Kevin
________________________________
From: Dmitry Mescheryakov [[email protected]]
Sent: Thursday, July 28, 2016 5:22 AM
To: Fox, Kevin M
Cc: Sam Morrison; OpenStack Operators
Subject: Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to 
Liberty

2016-07-26 21:20 GMT+03:00 Fox, Kevin M 
<[email protected]<mailto:[email protected]>>:
It only relates to Kubernetes in that Kubernetes can do automatic rolling 
upgrades by destroying/replacing a service. If the services don't clean up 
after themselves, then performing a rolling upgrade will break things.

So, what do you think is the best approach to ensuring all the services shut 
things down properly? Seems like its a cross project issue? Should a spec be 
submitted?

I think that it would be fair if Kubernates sends a sigterm to OpenStack 
service in a container, then wait for the service to shut down and only then 
destroy the container.

It might be not very important for our case though, if we agree to split 
expiration time for fanout and reply queues. And I don't know of any other case 
where an OpenStack service needs to clean up on shutdown in some external place.

Thanks,

Dmitry

Thanks,
Kevin
________________________________
From: Dmitry Mescheryakov 
[[email protected]<mailto:[email protected]>]
Sent: Tuesday, July 26, 2016 11:01 AM
To: Fox, Kevin M
Cc: Sam Morrison; OpenStack Operators

Subject: Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to 
Liberty

2016-07-25 18:47 GMT+03:00 Fox, Kevin M 
<[email protected]<mailto:[email protected]>>:
Ah. Interesting.

The graceful shutdown would really help the Kubernetes situation too. 
Kubernetes can do easy rolling upgrades and having the processes being able to 
clean up after themselves as they are upgraded is important. Is this something 
that needs to go into oslo.messaging or does it have to be added to all 
projects using it?

It both needs to be fixed on oslo.messaging side (delete fanout queue on RPC 
server stop, which is done by Kirill's CR) and on side of projects using it, as 
they need to actually stop RPC server before shutting down. As I wrote earlier, 
among Neutron processes right now only openvswitch and metadata agents do not 
stop RPC server.

I am not sure how that relates to Kubernates, as I not much familiar with it.

Thanks,

Dmitry

Thanks,
Kevin
________________________________
From: Dmitry Mescheryakov 
[[email protected]<mailto:[email protected]>]
Sent: Monday, July 25, 2016 3:47 AM
To: Sam Morrison
Cc: OpenStack Operators
Subject: Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to 
Liberty

Sam,

For your case I would suggest to lower rabbit_transient_queues_ttl until you 
are comfortable with volume of messages which comes during that time. Setting 
the parameter to 1 will essentially replicate bahaviour of auto_delete queues. 
But I would suggest not to set it that low, as otherwise your OpenStack will 
suffer from the original bug. Probably a value like 20 seconds should work in 
most cases.

I think that there is a space for improvement here - we can delete reply and 
fanout queues on graceful shutdown. But I am not sure if it will be easy to 
implement, as it requires services (Nova, Neutron, etc.) to stop RPC server on 
sigint and I don't know if they do it right now.

I don't think we can make case with sigkill any better. Other than that, the 
issue could be investigated on Neutron side, maybe number of messages could be 
reduced there.

Thanks,

Dmitry

2016-07-25 9:27 GMT+03:00 Sam Morrison 
<[email protected]<mailto:[email protected]>>:
We recently upgraded to Liberty and have come across some issues with queue 
build ups.

This is due to changes in rabbit to set queue expiries as opposed to queue auto 
delete.
See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more information.

The fix for this bug is in liberty and it does fix an issue however it causes 
another one.

Every time you restart something that has a fanout queue. Eg. cinder-scheduler 
or the neutron agents you will have
a queue in rabbit that is still bound to the rabbitmq exchange (and so still 
getting messages in) but no consumers.

These messages in these queues are basically rubbish and don’t need to exist. 
Rabbit will delete these queues after 10 mins (although the default in master 
is now changed to 30 mins)

During this time the queue will grow and grow with messages. This sets off our 
nagios alerts and our ops guys have to deal with something that isn’t really an 
issue. They basically delete the queue.

A bad scenario is when you make a change to your cloud that means all your 1000 
neutron agents are restarted, this causes a couple of dead queues per agent to 
hang around. (port updates and security group updates) We get around 25 
messages / second on these queues and so you can see after 10 minutes we have a 
ton of messages in these queues.

1000 x 2 x 25 x 600 = 30,000,000 messages in 10 minutes to be precise.

Has anyone else been suffering with this before a raise a bug?

Cheers,
Sam

_______________________________________________
OpenStack-operators mailing list
[email protected]<mailto:[email protected]>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

Reply via email to