Hey, Want to open a thread to discuss the following race I encountered while unit testing ovn.
The most simple case is when I run ovn-nbctl to add a lport in unit test: 1. ovn-nbctl first creates/commits the logical_port entry in ovn-nb database. the new entry's "up" column is empty, 2. then assume ovn-nbctl execution got suspended after ovsdb_idl_txn_commit_block(), 3. next, ovn-northd will update the ovn-sb database and finds that the new logical port is not bound. so it goes ahead update the "up" column of the entry to "false"... 4. since ovn-nbctl is still running and is set to monitor everything, the ovsdb-server will try sending the "update" to ovn-nbctl... 5. now consider this race: if ovn-nbctl execution resumes and exits right before ovsdb-server sending the update,... the send will fail with (Broken Pipe) error, resulting in a WARN log in ovsdb-server.log. Even if we set the "up" column to "false" at creation, we can still run into similar race if the ovn-controller quickly binds the lport to chassis and ovn-northd now updates "up" column to "true". I also found similar race for other command combinations... e.g. deleting vtep switch physical port and deleting ovs port while running ovs-vtep simulator... I'm thinking instead of trying to fix every case (which may not be even possible), we can try removing all monitor request right after ovsdb_idl_txn_commit_block() and try waiting until receiving the monitor request ack from ovsdb-server. After that ovsdb-server will never try sending anything to "*-*ctl" commands, Would like to hear what you think?~ Thanks, Alex Wang, _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev