Public bug reported: Neutron has many code paths that can collide and be raceful which each other. Current ongoing work can mitigate and minimize these races but work is slow and it's very hard to fight against what you don't know (ie. there can always be more races you're not aware of). A DLM (Distributed Lock Mechanism) such as tooz [1] can help mitigate this greatly.
An excellent example of this racefulness in Neutron is the L3's auto_schedule_routers functionality. When creating a tenant's first HA router more resources must also be created (such as a HA network and HA ports). This specific flow of creating the resources can be invoke simultaneously by 2 codepaths: the original create_router (invoked from the REST API) and from the L3 agent's get_router_ids/sync_routers. These simultaneous runs can produce many races, such as creating 2 HA networks (where only one should exist), accidentally deleting valid port bindings and more. Instead of hunting down these races (which can be a long and inaccurate task since more races can always exist), this can be solved much easily by locking the operations done on a single router_id. Using tooz [1] allows for a distributed lock, which crosses all the API/RPC workers on a single server and even crosses multiple neutron- servers. Also, this will help mitigate all sort of races with different resources (a lock can be associated with a uuid so it won't matter if the uuid is a router_id, network_id....) [1]: https://github.com/openstack/tooz/tree/master/ ** Affects: neutron Importance: Undecided Status: New ** Tags: rfe -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1552680 Title: [RFE] Add support for DLM Status in neutron: New Bug description: Neutron has many code paths that can collide and be raceful which each other. Current ongoing work can mitigate and minimize these races but work is slow and it's very hard to fight against what you don't know (ie. there can always be more races you're not aware of). A DLM (Distributed Lock Mechanism) such as tooz [1] can help mitigate this greatly. An excellent example of this racefulness in Neutron is the L3's auto_schedule_routers functionality. When creating a tenant's first HA router more resources must also be created (such as a HA network and HA ports). This specific flow of creating the resources can be invoke simultaneously by 2 codepaths: the original create_router (invoked from the REST API) and from the L3 agent's get_router_ids/sync_routers. These simultaneous runs can produce many races, such as creating 2 HA networks (where only one should exist), accidentally deleting valid port bindings and more. Instead of hunting down these races (which can be a long and inaccurate task since more races can always exist), this can be solved much easily by locking the operations done on a single router_id. Using tooz [1] allows for a distributed lock, which crosses all the API/RPC workers on a single server and even crosses multiple neutron- servers. Also, this will help mitigate all sort of races with different resources (a lock can be associated with a uuid so it won't matter if the uuid is a router_id, network_id....) [1]: https://github.com/openstack/tooz/tree/master/ To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1552680/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp