"dev" <dev-boun...@openvswitch.org> wrote on 07/05/2016 07:58:24 AM:
> From: Lance Richardson <lrich...@redhat.com> > To: dev@openvswitch.org > Date: 07/05/2016 07:58 AM > Subject: [ovs-dev] [PATCH] ovn-controller: eliminate stall in ofctrl > state machine > Sent by: "dev" <dev-boun...@openvswitch.org> > > The "ovn -- 2 HVs, 3 LRs connected via LS, static routes" > test case currently exhibits frequent failures. These failures > occur because, at the time that the test packets are sent to > verify forwarding, no flows have been installed in the vswitch > for one of the hypervisors. > > Investigation shows that, in the failing case, the ofctrl state > machine has not yet transitioned to the S_UPDATE_FLOWS state. > This occurrs when ofctrl_run() is called and: > 1) The state is S_TLV_TABLE_MOD_SENT. > 2) An OFPTYPE_NXT_TLV_TABLE_REPLY message is queued for reception. > 3) No event (other than SB probe timer expiration) is expected > that would unblock poll_block() in the main ovn-controller > loop. > > In this scenario, ofctrl_run() will move state to S_CLEAR_FLOWS > and return, without having executed run_S_CLEAR_FLOWS() which > would have immediately transitioned the state to S_UPDATE_FLOWS > which is needed in order for ovn-controller to configure flows > in ovs-vswitchd. After a delay of about 5 seconds (the default > SB probe timer interval), ofctrl_run() would be called again > to make the transition to S_UPDATE_FLOWS, but by this time > the test case has already failed. > > Fix by expanding the state machine's "while state != old_state" > loop to include processing of receive events. Without this > fix, around 40 failures are seen out of 100 attempts, with > this fix no failures have been observed in several hundred > attempts. > > Signed-off-by: Lance Richardson <lrich...@redhat.com> > --- I was going to simple ack this as being useful for the unit tests, but then I got to wondering if it made a difference in the real world, so I set up the following: 4 node OpenStack cloud running tip of tree master. Run rally's create-and-list-ports test four times: 1a+b) with 15 repetitions (150 ports) and 3 tenants, w/ and w/o this patch 2a+b) with 15 repetitions (150 ports) and 15 tenants, w/ and w/o this patch Dumping data and running t-tests on the create port times gives me: - a very statistically significant 22% difference between 1a and 1b (2-tailed P of 0.0057) - an extremely statistically significant 35% difference between 2a and 2b (2-tailed P of 0.0001) So... Acked-By: Ryan Moats <rmo...@us.ibm.com> Tested-By: Ryan Moats <rmo...@us.ibm.com> Can we get this merged quickly, pretty please??? :) _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev