Re: [openstack-dev] [Neutron] Apparently weird timeout issue

Salvatore Orlando Mon, 20 Jan 2014 08:36:39 -0800

I think you're right Darragh.

It was actually Montreal's snow and cold freezing my brain as I
investigated the same issue a while ago and tried to change cirrOS to send
a DHCPDISCOVER every 10 seconds instead of 60 seconds, but then I moved to
something else as I wasn't even sure a new centos base image could have
been brought into gate tests.


I think I also sent a related email to the mailing list, suggesting to
increase timeouts to a value that would ensure at least a second
DHCPDISCOVER is sent by the VM. Anyway, we have a few patches which should
make this failure mode less frequent. They're all -2 currently as they're
always failing the gate (and I don't know why). However, from another email
Sean recently sent, it seems it's a general Neutron issue.

Salvatore



On 20 January 2014 10:51, Darragh O'Reilly <dara2002-openst...@yahoo.com>wrote:

>
> On Monday, 20 January 2014, 15:33, Jay Pipes <jaypi...@gmail.com> wrote:
>
> >Sorry for top-posting -- using web mail client.
> no worries - it doesn't bother me.
> >
> >Is it possible to change the retry interval in Cirros (or cloud-init?) so
> that the backoff is less than 60 seconds?
> I think the udhcpc command line parameters are baked into the image. It's
> part of BusyBox, and I'm not even sure if it's configurable from a
> script/text file.
> >
> >Best,
> >
> -jay
> >
> >
> >
> >
> >On Mon, Jan 20, 2014 at 10:23 AM, Darragh O'Reilly <
> dara2002-openst...@yahoo.com> wrote:
> >
> >
> >>I did a test to see what the dhcp client on cirros does. I killed the
> dhcp agent and started an instance. The instance sent the first dhcp offer
> after about 35 sec. Then another 60 sec later, and a final one after
> another 60 sec.
> >>
> >>
> >>So a revised theory for what happened is this:
> >>
> >>t=0 tempest starts vm and starts polling for ACTIVE status
> >>t=20 instance-->ACTIVE and tempest starts polling the floating ip for 60
> sec
> >>t=40 instance does a dhcp discover - no response - so sets a timer for
> 60 sec
> >>t=45 ovs-agent sets the port vlan
> >>t=80 tempest gives up and kills vm
> >>t=100 instance would have sent another dhcp discover now if it had been
> let live
> >>
> >>I think it would be worth trying to change that test to poll for 120
> seconds instead of 60.
> >>
> >>
> >>
> >>On Monday, 20 January 2014, 11:23, Darragh O'Reilly <
> dara2002-openst...@yahoo.com> wrote:
> >>
> >>Hi Salvatore,
> >>>
> >>>
> >>>I presume it's this one?
> >>>
> http://logs.openstack.org/38/65838/4/check/check-tempest-dsvm-neutron-isolated/d108e4a/logs/tempest.txt.gz?#_2014-01-19_20_50_14_604
> >>>
> >>>
> >>>Is it true that the cirros image just fires off a few dhcp discovers
> and then gives up? If so, then maybe it did so before the tagging happened.
> Do we have the instance console log? It took about 45 seconds from when the
> port was created to when it was tagged.
> >>>
> >>>
> >>>2014-01-19 20:48:57.412 8142 DEBUG neutron.agent.linux.ovsdb_monitor
> [-] Output
> received from ovsdb monitor:
>
> {"data":[["3602a7b2-b559-4709-9bf0-53ae2af68d06","insert","tap496b808c-b5"]],"headings":["row","action","name"]}
> >>><snip>
> >>>2014-01-19 20:49:41.925 8142 DEBUG neutron.agent.linux.utils [-]
> >>>Command:
> ['sudo', '/usr/local/bin/neutron-rootwrap',
> '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'set',
> 'Port', 'tap496b808c-b5', 'tag=64']
> >>>Exit code: 0
> >>>
> >>>
> >>>Darragh.
> >>>
> >>>
> >>>
> >>>>I have been seeing in the past 2 days timeout failures on gate jobs
> which I
> >>>>am struggling to explain. An example is
> available in [1]
> >>>>These are the usual failure that we associate with bug 1253896, but
> this
> >>>>time I can verify that:
> >>>>- The floating IP is correctly wired (IP and NAT rules)
> >>>>- The DHCP port is correctly wired, as well as the VM port and the
> router
> >>>>port
> >>>>- The DHCP agent is correctly started for the network
> >>>>
> >>>>However, no DHCP DISCOVER request is sent. Only the DHCP RELEASE
> message is
> >>>>seen.
> >>>>Any help at interpreting the logs will be appreciated.
> >>>>
> >>>>
> >>>>Salvatore
> >>>>
> >>>>[1] http://logs.openstack.org/38/65838
> >>>
> >>>
> >>>
> >>_______________________________________________
> >>OpenStack-dev mailing list
> >>OpenStack-dev@lists.openstack.org
> >>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>
> >>
> >
> >
> >
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Neutron] Apparently weird timeout issue

Reply via email to