On 5/26/23 08:37, Ales Musil wrote: > On Fri, May 26, 2023 at 7:58 AM Han Zhou <hz...@ovn.org> wrote: > >> >> >> On Thu, May 25, 2023 at 9:19 AM Ilya Maximets <i.maxim...@ovn.org> wrote: >>> >>> On 5/25/23 14:08, Ales Musil via discuss wrote: >>>> Hi, >>>> >>>> to improve the MAC binding aging mechanism we need a way to ensure >> that rows which are still in use are preserved. This doesn't happen with >> current implementation. >>>> >>>> I propose the following solution which should solve the issue, any >> questions or comments are welcome. If there isn't anything major that would >> block this approach I would start to implement it so it can be available on >> 23.09. >>>> >>>> For the approach itself: >>>> >>>> Add "mac_cache_use" action into "lr_in_learn_neighbor" table (only the >> flow that continues on known MAC binding): >>>> match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 || >> REGBIT_LOOKUP_NEIGHBOR_IP_RESULT == 0), action=(next;) -> >> match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 || >> REGBIT_LOOKUP_NEIGHBOR_IP_RESULT == 0), action=(mac_cache_use; next;) >>>> >>>> The "mac_cache_use" would translate to resubmit into separate table >> with flows per MAC binding as follows: >>>> match=(ip.src=<MB_IP>, eth.src=<MB_MAC>, datapath=<MB_Datapath>), >> action=(drop;) >> >> It is possible that some workload has heavy traffic for ingress direction >> only, such as some UDP streams, but not sending anything out for a long >> interval. So I am not sure if using "src" only would be sufficient. >> > > Using table 67 (for the "dst" check) still has the black hole problem. I'm > not sure if there is a universal solution that would satisfy UDP traffic > that goes only one way, no matter which. > Also the negative effect of re-ARPing on UDP traffic should be a lot > smaller than connection protocols. > >
I agree, it's not sufficient for all cases but I don't think anything works for the uni-directional traffic scenario. We need to be conservative and re-ARP in that case. So Ales' suggestion sounds good to me. >> >>> >>> One concern here would be that it will likely cause a packet clone >>> in the datapath just to immediately drop it. So, might have a >>> noticeable performance impact. >>> >> +1. We need to be careful to avoid any dataplane performance impact, which >> doesn't sound justified for the value. >> > > We can test this out if there is a clone or not, if so we might need to > include it as a regular table in the pipeline. > > A test should clarify whether there will be a clone() or not. >> >>>> >>>> This should bump the statistics every time for the correct MAC >> binding. In ovn-controller we could periodically dump the flows from this >> table. the period would be set to MIN(mac_binding_age_threshold/2) from all >> local datapaths. The dump would happen from a different thread with its own >> rconn to prevent backlogging issues. The thread would receive mapped data >> from I-P node that would keep track of mapping datapath -> cookies -> mac >> bindings. This allows us to avoid constant lookups, but at the cost of >> keeping track of all local MAC bindings. To save some computation time this >> I-P could be relevant only for datapaths that actually have the threshold >> set. >>>> >>>> If the "idle_age" of the particular flow is smaller than the datapath >> "mac_binding_age_threshold" it means that it is still in use. To prevent a >> lot of updates, if the traffic is still relevant on multiple controllers, >> we would check if the timestamp is older than the "dump period"; if not we >> don't have to update it, because someone else did. >> >> Thanks for trying to reduce the number of updates to SB DB, but I still >> have some concerns for this. In theory, to prevent the records being >> deleted while it is still used, at least one timestamp update is required >> per threshold for each record. Even if we bundle the updates from each >> node, assume that the workloads that own the IP/MAC of the mac_binding >> records are distributed across 1000 nodes, and the aging threshold is 30s, >> there will still be ~30 updates/s (if we can evenly distribute the updates >> from different nodes). That's still a lot, which may keep the SB server and >> all the ovn-controller busy just for these messages. If the aging threshold >> is set to 300s (5min), it may look better: ~3updates/s, but this still >> could contribute to the major part of the SB <-> ovn-controller messages, >> e.g. in ovn-k8s deployment the cluster LR is distributed on all nodes so >> all nodes would need to monitor all mac-binding timestamp updates related >> to the cluster LR, which means all mac-binding updates from all nodes. In >> reality the amount of messages may be doubled if we use the proposed >> dump-and-check interval mac_binding_age_threshold/2. >> > > Arguably the 30 second threshold is way too low. You are right that the > double update might happen in some cases, that can still be mitigated by > shifting the check of timestamp older than 3/4 of the threshold for > example, that should still give us enough time to perform the update, but > it will greatly reduce the chance of doing the update twice. > Should we add some recommendations for "good" aging timeout values in ovn-nb.xml? > One thing to keep in mind is that these updates would happen only for MAC > bindings that have the aging enabled. For some parts of the cluster that > are fixed enough, or the MAC binding is growing very slowly it might not > make sense to enable it, if so it might be with a pretty big threshold e.g. > 3000s. > > >> >> So, I'd evaluate both the dataplane and control plane cost before going >> for formal implementation. >> > > The control plane impact is really configuration dependent which is fine > IMO. If we document the possible outcome I think it should be safe. I will > for sure try to minimize it as much as possible by doing the mentioned > checks/optimizations. > > As long as with the default configuration (aging disabled) we don't have any control/data plane impact I think we're OK. I do hope that there's no significant data plane impact with aging enabled either though. >> >> Thanks, >> Han >> >>> >>>> >>>> Also to "desync" the controllers there would be a random delay added >> to the "dump period". >>>> >>>> All of this would be applicable to FDB aging as well. >>>> >>>> Does that sound reasonable? >>>> Please let me know if you have any comments/suggestions. >>>> >>>> Thanks, >>>> Ales >>> >> Thanks, Dumitru _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss