Hi George, Thanks for the suggestions. I've tried a few things, but unfortunately I had already restarted all of the neutron services on all of the nodes, so I was unable to do a before/after comparison. I did do a before/after comparison with a floatingip-associate command. I did the following:
1) Dump the output of 'ip address' and 'iptables -S' on the network node and the compute node containing the VM. 2) Associate the floating ip with a running VM. 3) Dump the output of 'ip address' and 'iptables -S' and diff against the originals. I'm not seeing any change in any of the output, which seems wrong to me. The l3 agent log is extremely long, but it only contains a few types of entries. I've patched together one line for each type of entry I was able to find instead of dumping thousands of repeating lines in this email. The only 2015-05-14 21:22:03.910 25800 INFO neutron.agent.l3_agent [req-8436d571-b12e-4218-b13b-e5dddb461370 None] L3 agent started Command: ['ip', '-o', 'link', 'show', 'br-ex'] Exit code: 0 Stdout: '6: br-ex: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default \\ link/ether f4:ce:46:81:bf:1a brd ff:ff:ff:ff:ff:ff\n' Stderr: '' execute /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:75 2015-05-14 21:22:30.794 25800 DEBUG neutron.openstack.common.lockutils [req-8436d571-b12e-4218-b13b-e5dddb461370 None] Semaphore / lock released "_rpc_loop" inner /usr/lib/python2.7/dist-packages/neutron/openstack/common/lockutils.py:252 2015-05-14 21:22:31.793 25800 DEBUG neutron.openstack.common.lockutils [req-8436d571-b12e-4218-b13b-e5dddb461370 None] Got semaphore "l3-agent" lock /usr/lib/python2.7/dist-packages/neutron/openstack/common/lockutils.py:168 2015-05-14 21:22:31.794 25800 DEBUG neutron.openstack.common.lockutils [req-8436d571-b12e-4218-b13b-e5dddb461370 None] Got semaphore / lock "_rpc_loop" inner /usr/lib/python2.7/dist-packages/neutron/openstack/common/lockutils.py:248 2015-05-14 21:22:31.794 25800 DEBUG neutron.agent.l3_agent [req-8436d571-b12e-4218-b13b-e5dddb461370 None] Starting RPC loop for 0 updated routers _rpc_loop /usr/lib/python2.7/dist-packages/neutron/agent/l3_agent.py:823 2015-05-14 21:22:31.794 25800 DEBUG neutron.agent.l3_agent [req-8436d571-b12e-4218-b13b-e5dddb461370 None] RPC loop successfully completed _rpc_loop /usr/lib/python2.7/dist-packages/neutron/agent/l3_agent.py:840 2015-05-14 21:34:47.918 25800 DEBUG neutron.agent.l3_agent [req-8436d571-b12e-4218-b13b-e5dddb461370 None] Starting _sync_routers_task - fullsync:False _sync_routers_task /usr/lib/python2.7/dist-packages/neutron/agent/l3_agent.py:861 2015-05-14 21:35:03.789 25800 DEBUG neutron.openstack.common.rpc.amqp [req-8436d571-b12e-4218-b13b-e5dddb461370 None] UNIQUE_ID is 16d55615cce34d49a82e52d47b0b0518. _add_unique_id /usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/amqp.py:342 2015-05-14 21:35:33.790 25800 DEBUG neutron.openstack.common.rpc.amqp [req-8436d571-b12e-4218-b13b-e5dddb461370 None] Making asynchronous cast on q-plugin... cast /usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/amqp.py:583 Nothing above the INFO or DEBUG level. -Matt On Thu, May 14, 2015 at 4:51 AM, George Mihaiescu <lmihaie...@gmail.com> wrote: > Hi Matt, > > The L3 agent is in charge of implementing NAT rules inside the qrouter > namespace, and it probably failed while Neutron API was down. > > I would dump the iptables rules, restart the agent(s) on the network node > and compare the iptables and 'ip address' output from before and after. > > Also, enabling debug and verbose in neutron.conf before restarting the > agent(s) should bring up any still existing errors. > > George > On 14 May 2015 01:16, "Matt Davis" <mattd5...@gmail.com> wrote: > >> Hi all, >> >> I've been diagnosing a problem on my icehouse install (ubuntu with a >> 3-node galera cluster as the database backend) and I've gotten the system >> into a bad state. The glitch I've been chasing is that the neutron API >> becomes unresponsive for a few minutes approximately every half hour before >> returning to normal. Nothing obvious in the logs (no warnings, errors, or >> critical output seems correllated with the failure). The request goes in >> and I get no response back. After 5 minutes or so, it returns to normal. >> >> The second problem is that while I reconfigured the system to debug >> (removing proxy layers, connecting directly to a single galera cluster >> node, etc.), I think I broke something having to do with floating IPs. Now >> when I connect a floating IP to a VM, it the IP shows up as "DOWN" instead >> of "ACTIVE" and I'm unable to ping it. Notes: >> >> 1) The underlying VM port is active and works as expected. I can >> connect to the fixed IP from within the VM's virtual network. The VM can >> connect to the outside world. >> 2) Existing VMs with existing floating IPs work as expected. >> 3) If I create a new VM and try to apply an existing floating IP to it >> (one that was working on a previous VM), the status for that floating IP >> remains "ACTIVE" but I'm unable to ping it. >> 4) All of the security groups for all of the VMs are the same. >> >> Floating IP manipulation doesn't seem to produce a lot of debugging >> content in the logs, so it's difficult to trace this one. I don't know if >> the neutron API glitches are related or if the floating IP problem is a >> second issue that I created in the process of debugging. >> >> Any idea where I should look? >> >> Thanks, >> >> -Matt >> >> _______________________________________________ >> Mailing list: >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack >> Post to : openstack@lists.openstack.org >> Unsubscribe : >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack >> >>
_______________________________________________ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack