On Thu, Jul 07, 2016 at 08:31:08PM -0400, Lance Richardson wrote: > The "ovn -- 2 HVs, 3 LRs connected via LS, static routes" > test case currently exhibits frequent failures. These failures > occur because, at the time that the test packets are sent to > verify forwarding, no flows have been installed in the vswitch > for one of the hypervisors. > > Investigation shows that, in the failing case, the ofctrl state > machine has not yet transitioned to the S_UPDATE_FLOWS state. > This occurrs when ofctrl_run() is called and: > 1) The state is S_TLV_TABLE_MOD_SENT. > 2) An OFPTYPE_NXT_TLV_TABLE_REPLY message is queued for reception. > 3) No event (other than SB probe timer expiration) is expected > that would unblock poll_block() in the main ovn-controller > loop. > > In this scenario, ofctrl_run() will move state to S_CLEAR_FLOWS > and return, without having executed run_S_CLEAR_FLOWS() which > would have immediately transitioned the state to S_UPDATE_FLOWS > which is needed in order for ovn-controller to configure flows > in ovs-vswitchd. After a delay of about 5 seconds (the default > SB probe timer interval), ofctrl_run() would be called again > to make the transition to S_UPDATE_FLOWS, but by this time > the test case has already failed. > > Fix by expanding the state machine's "while state != old_state" > loop to include processing of receive events, with a maximum > iteration limit to prevent excessive looping in pathological > cases. Without this fix, around 40 failures are seen out of > 100 attempts, with this fix no failures have been observed in > several hundred attempts. > > Signed-off-by: Lance Richardson <lrich...@redhat.com> > --- > v2: Added maximum iteration limit to state machine loop.
Thanks for investigating this. I understand why your patch fixes a problem, but I don't yet understand the root cause of the problem, and I'd like to know more before I commit the patch. I have two questions: First, I don't understand why, if there's a message queued for reception as you describe in 2), the main loop would wait for a timer expiration. Before poll_block(), the main loop should call ofctrl_wait(), which calls rconn_recv_wait(), which should cause poll_block() to wake up if there's anything incoming from the OpenFlow connection. Second, I don't know why the test "state == old_state" is here in ofctrl_run(). I think that we can delete it. It could be a culprit, but on the other hand it seems like it's not a big deal if poll_block() properly wakes up when a message is received: for (int i = 0; state == old_state && i < 50; i++) { struct ofpbuf *msg = rconn_recv(swconn); if (!msg) { break; } Thanks, Ben. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev