On Fri, Oct 15, 2021 at 3:53 AM Seena Fallah <[email protected]> wrote:
> In the case of having many projects each project has at least 2 security > groups and each security group has 5 ACLs this ACL number should be not > very high I think. > Ok, assume each project has 2 x 5 = 10 ACLs, 100k ACLs means you have 20k projects. That sounds not a small number. If each project has its own LRs and LSes, and 10 ~ 100 workloads, it sounds like something really big. Or if they just share the LRs and LSes, and each project has only a few workloads, then it may be ok. Still, regardless of the scale, I am surprised that you hit scale problems in NB but not in SB. > In ovs scenario, I have 250K ACLs and everything works fine! > What do you mean by ACLs in ovs? 250k ACLs meaning 250k OVS flows? An OVN ACL can easily be translated into thousands of OVS flows, if a big address set is referenced by the OVN ACL. So I am not sure what does these numbers mean exactly in your deployment. Do you think OVN is not ready for this number of ACLs? > I'm switching from ovs to ovn. > > On Fri, Oct 15, 2021 at 4:41 AM Han Zhou <[email protected]> wrote: > >> >> >> On Thu, Oct 14, 2021 at 7:25 AM Seena Fallah <[email protected]> >> wrote: >> >>> It's mostly on nb. >>> >> I am surprised since we usually don't see any scale problem for the NB DB >> servers, because usually SB data size is much bigger and also number of >> clients are much bigger than NB DB. So if there are scale problems it would >> always happen on SB already before NB hits any limit. >> You would see NB scale problem but not on SB probably because ovn-northd >> couldn't even translate the NB data to SB yet because of the NB problem you >> hit. I'd suggest to start with smaller scale, and make sure it works end to >> end, and then enlarge it gradually, then you would see the real limit. >> Somehow 100k ACLs sound scary to me. Usually the number of ACLs is not so >> big but each ACL could reference big address-sets and port-groups. You >> could probably give more details about your topology and what your typical >> ACLs look like. >> >> >>> Yes, I set that value before to 60000 but it didn't help! >>> >>> On Sun, Oct 10, 2021 at 10:34 PM Han Zhou <[email protected]> wrote: >>> >>>> >>>> >>>> On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah <[email protected]> >>>> wrote: >>>> > >>>> > Also I get many logs like this in ovn: >>>> > >>>> > 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in >>>> last 8 seconds (most recently, 3 seconds ago) due to excessive rate >>>> > 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454: >>>> receive error: Connection reset by peer >>>> > 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454: >>>> connection dropped (Connection reset by peer) >>>> > 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224: >>>> connection dropped (Connection reset by peer) >>>> > 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514: >>>> connection dropped (Connection reset by peer) >>>> > 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544: >>>> connection dropped (Connection reset by peer) >>>> > 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846: >>>> connection dropped (Connection reset by peer) >>>> > 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796: >>>> connection dropped (Connection reset by peer) >>>> > >>>> > What does it mean about excessive rate? How many req/s is going to be >>>> an excessive rate? >>>> >>>> Don't worry about "excessive rate", which is talking about the log rate >>>> limit itself. >>>> The "connection reset by peer" indicates client side inactivity probe >>>> is enabled and it disconnects when the server hasn't responded for a while. >>>> What server is this? NB or SB? Usually SB DB would have this problem if >>>> there are lots of nodes and if the inactivity probe is not adjusted on the >>>> nodes (ovn-controllers). Try: ovs-vsctl set open . >>>> external_ids:ovn-remote-probe-interval=100000 on each node. >>>> >>>> > >>>> > On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah <[email protected]> >>>> wrote: >>>> >> >>>> >> Seems the most leader failure is for NB and the command you said is >>>> for SB. >>>> >> >>>> >> Do you have any benchmarks of how many ACLs can OVN perform normally? >>>> >> I see many failures after 100k ACLs. >>>> >> >>>> >> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique <[email protected]> >>>> wrote: >>>> >>> >>>> >>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah <[email protected]> >>>> wrote: >>>> >>> > >>>> >>> > I'm using these versions on a centos container: >>>> >>> > ovsdb-server (Open vSwitch) 2.15.2 >>>> >>> > ovn-nbctl 21.06.0 >>>> >>> > Open vSwitch Library 2.15.90 >>>> >>> > DB Schema 5.32.0 >>>> >>> > >>>> >>> > Today I see the election timed out too and I should increase >>>> ovsdb election timeout too. I saw the commits but I didn't find any related >>>> change to my problem. >>>> >>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to >>>> increase election timeout and disable the inactivity probe? >>>> >>> >>>> >>> Not sure on that. It's worth a try if you have a test environment. >>>> >>> >>>> >>> > Also is there any limitation on the number of ACLs that can OVN >>>> handle? >>>> >>> >>>> >>> I don't think there is any limitation on the number of ACLs. In >>>> >>> general as the size of the SB DB increases, we have seen issues. >>>> >>> >>>> >>> Can you run the below command on each of your nodes where >>>> >>> ovn-controller runs and see if that helps ? >>>> >>> >>>> >>> --- >>>> >>> ovs-vsctl set open . external_ids:ovn-monitor-all=true >>>> >>> --- >>>> >>> >>>> >>> Thanks >>>> >>> Numan >>>> >>> >>>> >>> >>>> >>> > >>>> >>> > Thanks. >>>> >>> > >>>> >>> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique <[email protected]> >>>> wrote: >>>> >>> >> >>>> >>> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah < >>>> [email protected]> wrote: >>>> >>> >> > >>>> >>> >> > Hi, >>>> >>> >> > >>>> >>> >> > I use ovn for OpenStack neutron plugin for my production. >>>> After days I see issues about losing a leader in ovsdb. It seems it was >>>> because of the failing inactivity probe and because I had 17k acls. After I >>>> disable the inactivity probe it works fine but when I did a scale test on >>>> it (about 40k ACLS) again it fails the leader. >>>> >>> >> > I saw many docs about ovn at scale issues that were raised by >>>> both RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I >>>> checked it with northd-ddlog but nothing changes. >>>> >>> >> > >>>> >>> >> > My question is should I wait more for ovn to be stable for >>>> high scale or is there any tuning I miss in my deployment? >>>> >>> >> > Also, will the ovn-nb/sb rewrite with ddlog and can help the >>>> issues at a high scale? if yes is there any due time? >>>> >>> >> >>>> >>> >> What is the ovsdb-server version you're using ? There are many >>>> >>> >> improvements in the ovsdb-server in 2.16. >>>> >>> >> Maybe that would help in your deployment. And also there were >>>> many >>>> >>> >> improvements which went into OVN 21.09 >>>> >>> >> if you want to test it out. >>>> >>> >> >>>> >>> >> Thanks >>>> >>> >> Numan >>>> >>> >> >>>> >>> >> > >>>> >>> >> > Thanks. >>>> >>> >> > _______________________________________________ >>>> >>> >> > discuss mailing list >>>> >>> >> > [email protected] >>>> >>> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >>>> >>> > >>>> >>> > _______________________________________________ >>>> >>> > discuss mailing list >>>> >>> > [email protected] >>>> >>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >>>> > >>>> > _______________________________________________ >>>> > discuss mailing list >>>> > [email protected] >>>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >>>> >>>
_______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
