We're running into a problem periodically where we lose our qpid connection for one of our compute services. We're on Folsom in a 2-node setup with the compute services running on one node and qpidd, scheduler, network, etc., running on the other.
We've scaled this environment up to where we have 2800 instances created. When we hit this problem, the scheduler continues to get updates from the compute service so the service is still active, however looking at the qpid queues with "qpid-config queues", we see that the queue no longer exists and the compute service no longer receives spawn requests. The scheduler continues to select this compute service for new boot requests which get stuck in BUILD state. I have a trace here on pastebin http://pastebin.com/rDid7Egm The first error appears to be an RPC Timeout "Timed out waiting for RPC response: None " followed by an AssertionError in the qpid/messaging/driver.py. Any ideas about what might be happening would be appreciated. Also if you have thoughts on how to debug this further I'd love to hear them. Thanks! -Paul
_______________________________________________ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp