Re: [ovs-discuss] OVN at scale in production

Han Zhou Fri, 15 Oct 2021 15:05:34 -0700

On Fri, Oct 15, 2021 at 3:53 AM Seena Fallah <[email protected]> wrote:


> In the case of having many projects each project has at least 2 security
> groups and each security group has 5 ACLs this ACL number should be not
> very high I think.
>

Ok, assume each project has 2 x 5 = 10 ACLs, 100k ACLs means you have 20k
projects. That sounds not a small number. If each project has its own LRs
and LSes, and 10 ~ 100 workloads, it sounds like something really big. Or
if they just share the LRs and LSes, and each project has only a few
workloads, then it may be ok. Still, regardless of the scale, I am
surprised that you hit scale problems in NB but not in SB.


> In ovs scenario, I have 250K ACLs and everything works fine!
>

What do you mean by ACLs in ovs? 250k ACLs meaning 250k OVS flows? An OVN
ACL can easily be translated into thousands of OVS flows, if a big address
set is referenced by the OVN ACL. So I am not sure what does these numbers
mean exactly in your deployment.

Do you think OVN is not ready for this number of ACLs?
> I'm switching from ovs to ovn.
>
> On Fri, Oct 15, 2021 at 4:41 AM Han Zhou <[email protected]> wrote:
>
>>
>>
>> On Thu, Oct 14, 2021 at 7:25 AM Seena Fallah <[email protected]>
>> wrote:
>>
>>> It's mostly on nb.
>>>
>> I am surprised since we usually don't see any scale problem for the NB DB
>> servers, because usually SB data size is much bigger and also number of
>> clients are much bigger than NB DB. So if there are scale problems it would
>> always happen on SB already before NB hits any limit.
>> You would see NB scale problem but not on SB probably because ovn-northd
>> couldn't even translate the NB data to SB yet because of the NB problem you
>> hit. I'd suggest to start with smaller scale, and make sure it works end to
>> end, and then enlarge it gradually, then you would see the real limit.
>> Somehow 100k ACLs sound scary to me. Usually the number of ACLs is not so
>> big but each ACL could reference big address-sets and port-groups. You
>> could probably give more details about your topology and what your typical
>> ACLs look like.
>>
>>
>>> Yes, I set that value before to 60000 but it didn't help!
>>>
>>> On Sun, Oct 10, 2021 at 10:34 PM Han Zhou <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah <[email protected]>
>>>> wrote:
>>>> >
>>>> > Also I get many logs like this in ovn:
>>>> >
>>>> > 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in
>>>> last 8 seconds (most recently, 3 seconds ago) due to excessive rate
>>>> > 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454:
>>>> receive error: Connection reset by peer
>>>> > 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454:
>>>> connection dropped (Connection reset by peer)
>>>> > 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224:
>>>> connection dropped (Connection reset by peer)
>>>> > 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514:
>>>> connection dropped (Connection reset by peer)
>>>> > 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544:
>>>> connection dropped (Connection reset by peer)
>>>> > 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846:
>>>> connection dropped (Connection reset by peer)
>>>> > 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796:
>>>> connection dropped (Connection reset by peer)
>>>> >
>>>> > What does it mean about excessive rate? How many req/s is going to be
>>>> an excessive rate?
>>>>
>>>> Don't worry about "excessive rate", which is talking about the log rate
>>>> limit itself.
>>>> The "connection reset by peer" indicates client side inactivity probe
>>>> is enabled and it disconnects when the server hasn't responded for a while.
>>>> What server is this? NB or SB? Usually SB DB would have this problem if
>>>> there are lots of nodes and if the inactivity probe is not adjusted on the
>>>> nodes (ovn-controllers). Try: ovs-vsctl set open .
>>>> external_ids:ovn-remote-probe-interval=100000 on each node.
>>>>
>>>> >
>>>> > On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> Seems the most leader failure is for NB and the command you said is
>>>> for SB.
>>>> >>
>>>> >> Do you have any benchmarks of how many ACLs can OVN perform normally?
>>>> >> I see many failures after 100k ACLs.
>>>> >>
>>>> >> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique <[email protected]>
>>>> wrote:
>>>> >>>
>>>> >>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah <[email protected]>
>>>> wrote:
>>>> >>> >
>>>> >>> > I'm using these versions on a centos container:
>>>> >>> > ovsdb-server (Open vSwitch) 2.15.2
>>>> >>> > ovn-nbctl 21.06.0
>>>> >>> > Open vSwitch Library 2.15.90
>>>> >>> > DB Schema 5.32.0
>>>> >>> >
>>>> >>> > Today I see the election timed out too and I should increase
>>>> ovsdb election timeout too. I saw the commits but I didn't find any related
>>>> change to my problem.
>>>> >>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to
>>>> increase election timeout and disable the inactivity probe?
>>>> >>>
>>>> >>> Not sure on that.  It's worth a try if you have a test environment.
>>>> >>>
>>>> >>> > Also is there any limitation on the number of ACLs that can OVN
>>>> handle?
>>>> >>>
>>>> >>> I don't think there is any limitation on the number of ACLs.  In
>>>> >>> general as the size of the SB DB increases, we have seen issues.
>>>> >>>
>>>> >>> Can you run the below command on each of your nodes where
>>>> >>> ovn-controller runs and see if that helps ?
>>>> >>>
>>>> >>> ---
>>>> >>> ovs-vsctl set open . external_ids:ovn-monitor-all=true
>>>> >>> ---
>>>> >>>
>>>> >>> Thanks
>>>> >>> Numan
>>>> >>>
>>>> >>>
>>>> >>> >
>>>> >>> > Thanks.
>>>> >>> >
>>>> >>> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique <[email protected]>
>>>> wrote:
>>>> >>> >>
>>>> >>> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah <
>>>> [email protected]> wrote:
>>>> >>> >> >
>>>> >>> >> > Hi,
>>>> >>> >> >
>>>> >>> >> > I use ovn for OpenStack neutron plugin for my production.
>>>> After days I see issues about losing a leader in ovsdb. It seems it was
>>>> because of the failing inactivity probe and because I had 17k acls. After I
>>>> disable the inactivity probe it works fine but when I did a scale test on
>>>> it (about 40k ACLS) again it fails the leader.
>>>> >>> >> > I saw many docs about ovn at scale issues that were raised by
>>>> both RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I
>>>> checked it with northd-ddlog but nothing changes.
>>>> >>> >> >
>>>> >>> >> > My question is should I wait more for ovn to be stable for
>>>> high scale or is there any tuning I miss in my deployment?
>>>> >>> >> > Also, will the ovn-nb/sb rewrite with ddlog and can help the
>>>> issues at a high scale? if yes is there any due time?
>>>> >>> >>
>>>> >>> >> What is the ovsdb-server version you're using ?  There are many
>>>> >>> >> improvements in the ovsdb-server in 2.16.
>>>> >>> >> Maybe that would help in your deployment.  And also there were
>>>> many
>>>> >>> >> improvements which went into OVN 21.09
>>>> >>> >> if you want to test it out.
>>>> >>> >>
>>>> >>> >> Thanks
>>>> >>> >> Numan
>>>> >>> >>
>>>> >>> >> >
>>>> >>> >> > Thanks.
>>>> >>> >> > _______________________________________________
>>>> >>> >> > discuss mailing list
>>>> >>> >> > [email protected]
>>>> >>> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>>> >>> >
>>>> >>> > _______________________________________________
>>>> >>> > discuss mailing list
>>>> >>> > [email protected]
>>>> >>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>>> >
>>>> > _______________________________________________
>>>> > discuss mailing list
>>>> > [email protected]
>>>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>>>
>>>

_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN at scale in production

Reply via email to