I failed to sent the reply to list... I should reactivate mutt. :-P ---------- Forwarded message --------- From: Christian Stelter <[email protected]> Date: Wed, Dec 1, 2021 at 6:32 PM Subject: Re: [ovs-discuss] Packet loss when using default security group with self reference To: Numan Siddique <[email protected]>
Hi! On Tue, Nov 30, 2021 at 7:22 PM Numan Siddique <[email protected]> wrote: > > On Tue, Nov 30, 2021 at 12:13 PM Daniel Alvarez <[email protected]> wrote: > > > > Hey Christian > > > > > On 30 Nov 2021, at 18:06, Christian Stelter <[email protected]> > > > wrote: > > > > > > > > > Hi! > > > > > > We’re observing currently packet loss on a 3 node etcd cluster (all 3 > > > nodes on different hypervisors) on one of our open stack clusters running > > > the victoria release deployed via kolla-ansible. > > > > > > Open vSwitch Library has version 2.13.3, the ovn-controller has version > > > 20.03.2 and the underlaying OS is Ubuntu 20.04 with current patches. > > > > > > We can reproduce the packet loss with this etcd setup in different > > > projects on that cluster, but not on a second cluster (our stage env) > > > with the same software versions and the same hardware components and same > > > sizing. > > > > > > When we replace the default security group with a security group that > > > uses the CDIR of the project network as remote security group instead of > > > “default” in the ingress rule (IP v4 Any Any) the etcd cluster performs > > > without packet loss/recurring leader elections. > > > > I am confused as the default SG will block ingress traffic in OpenStack by > > default. > > > > As this is an OVS/OVN ML, I would suggest to share the ACLs/Logical > > Flows/OpenFlows for both cases. This question, framed like this requires > > OpenStack (maybe even kolla-ansible if the default SG differs from the > > reference implementation) and etcd knowledge so I would advise to isolate > > the traffic pattern as much as possible as well as the packet loss % and > > other potentially useful data. > > > > > > > > > > Other projects or applications seem not to be impacted. At least none > > > that we know of. > > > > > > Any hints what could cause such a behavior? We suspect it's just a > > > symptom of another problem that we are currently not aware of. > > > > > In my opinion this could be due to an old bug in ovn-controller > related to wrong conjunction id generation. > > Is it possible for you to test with the latest OVN version ? > > If not can you run the below command and see if the packet loss issue > is resolved ? > > Run - ovn-appctl -t ovn-controller recompute. > > If running this command solves the issue, then it's definitely a known > issue which has been fixed in the later versions. If you can confirm > this works > I can share the commit which fixed this issue. Well before trying this on the prod env we are trying to reproduce the error on the stage env first so that we are not limited to do our tests during maintenance windows. So currently we 're trying to figure out how to build a reproducible load test setup on the stage env with a comparable amount of flows on the compute nodes. The etcd setup alone alone is not enough yet. As for the ACLs/Logical Flows/OpenFlows I will collect the data and report back. Best regards, Christian Stelter _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
