"dev" <dev-boun...@openvswitch.org> wrote on 07/05/2016 07:58:24 AM:

> From: Lance Richardson <lrich...@redhat.com>
> To: dev@openvswitch.org
> Date: 07/05/2016 07:58 AM
> Subject: [ovs-dev] [PATCH] ovn-controller: eliminate stall in ofctrl
> state machine
> Sent by: "dev" <dev-boun...@openvswitch.org>
>
> The "ovn -- 2 HVs, 3 LRs connected via LS, static routes"
> test case currently exhibits frequent failures. These failures
> occur because, at the time that the test packets are sent to
> verify forwarding, no flows have been installed in the vswitch
> for one of the hypervisors.
>
> Investigation shows that, in the failing case, the ofctrl state
> machine has not yet transitioned to the S_UPDATE_FLOWS state.
> This occurrs when ofctrl_run() is called and:
>    1) The state is S_TLV_TABLE_MOD_SENT.
>    2) An OFPTYPE_NXT_TLV_TABLE_REPLY message is queued for reception.
>    3) No event (other than SB probe timer expiration) is expected
>       that would unblock poll_block() in the main ovn-controller
>       loop.
>
> In this scenario, ofctrl_run() will move state to S_CLEAR_FLOWS
> and return, without having executed run_S_CLEAR_FLOWS() which
> would have immediately transitioned the state to S_UPDATE_FLOWS
> which is needed in order for ovn-controller to configure flows
> in ovs-vswitchd. After a delay of about 5 seconds (the default
> SB probe timer interval), ofctrl_run() would be called again
> to make the transition to S_UPDATE_FLOWS, but by this time
> the test case has already failed.
>
> Fix by expanding the state machine's "while state != old_state"
> loop to include processing of receive events. Without this
> fix, around 40 failures are seen out of 100 attempts, with
> this fix no failures have been observed in several hundred
> attempts.
>
> Signed-off-by: Lance Richardson <lrich...@redhat.com>
> ---

I was going to simple ack this as being useful for the unit tests,
but then I got to wondering if it made a difference in the real
world, so I set up the following:

4 node OpenStack cloud running tip of tree master.
Run rally's create-and-list-ports test four times:
1a+b) with 15 repetitions (150 ports) and 3 tenants,
      w/ and w/o this patch
2a+b) with 15 repetitions (150 ports) and 15 tenants,
      w/ and w/o this patch

Dumping data and running t-tests on the create port times gives me:

- a very statistically significant 22% difference between 1a and 1b
  (2-tailed P of 0.0057)
- an extremely statistically significant 35% difference between 2a
  and 2b (2-tailed P of 0.0001)

So...

Acked-By: Ryan Moats <rmo...@us.ibm.com>
Tested-By: Ryan Moats <rmo...@us.ibm.com>

Can we get this merged quickly, pretty please??? :)

_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to