On 3/21/23 07:18, Jake Yip wrote:
> 
> 
> On 20/3/2023 10:51 pm, Ilya Maximets wrote:
>> On 3/16/23 23:06, Jake Yip wrote:
>>> Hi all,
>>>
>>> Apologies for jumping into this thread. We are seeing the same and it's 
>>> nice to find someone with similar issues :)
>>>
>>> On 8/3/2023 3:43 am, Ilya Maximets via discuss wrote:
>>>>>>
>>>>>> We see failures on the OVSDB Relay side:
>>>>>>
>>>>>> 2023-03-06T22:19:32.966Z|00099|reconnect|ERR|ssl:xxx:16642: no response 
>>>>>> to inactivity probe after 5 seconds, disconnecting
>>>>>> 2023-03-06T22:19:32.966Z|00100|reconnect|INFO|ssl:xxx:16642: connection 
>>>>>> dropped
>>>>>> 2023-03-06T22:19:40.989Z|00101|reconnect|INFO|ssl:xxx:16642: connected
>>>>>> 2023-03-06T22:19:50.997Z|00102|reconnect|ERR|ssl:xxx:16642: no response 
>>>>>> to inactivity probe after 5 seconds, disconnecting
>>>>>> 2023-03-06T22:19:50.997Z|00103|reconnect|INFO|ssl:xxx:16642: connection 
>>>>>> dropped
>>>>>> 2023-03-06T22:19:59.022Z|00104|reconnect|INFO|ssl:xxx:16642: connected
>>>>>> 2023-03-06T22:20:09.026Z|00105|reconnect|ERR|ssl:xxx:16642: no response 
>>>>>> to inactivity probe after 5 seconds, disconnecting
>>>>>> 2023-03-06T22:20:09.026Z|00106|reconnect|INFO|ssl:xxx:16642: connection 
>>>>>> dropped
>>>>>> 2023-03-06T22:20:17.052Z|00107|reconnect|INFO|ssl:xxx:16642: connected
>>>>>> 2023-03-06T22:20:27.056Z|00108|reconnect|ERR|ssl:xxx:16642: no response 
>>>>>> to inactivity probe after 5 seconds, disconnecting
>>>>>> 2023-03-06T22:20:27.056Z|00109|reconnect|INFO|ssl:xxx:16642: connection 
>>>>>> dropped
>>>>>> 2023-03-06T22:20:35.111Z|00110|reconnect|INFO|ssl:xxx:16642: connected
>>>>>>
>>>>>> On the DB cluster this looks like:
>>>>>>
>>>>>> 2023-03-06T22:19:04.208Z|00451|stream_ssl|WARN|SSL_read: unexpected SSL 
>>>>>> connection close
>>>>>> 2023-03-06T22:19:04.211Z|00452|reconnect|WARN|ssl:xxx:52590: connection 
>>>>>> dropped (Protocol error)
>>>>
>>>> OK.  These are symptoms.  The cause must be something like
>>>> 'Unreasonably long MANY ms poll interval' on the DB cluster side.
>>>> i.e. the reason why the main DB cluster didn't reply to the
>>>> probes sent from the relay.  Because as soon as server receives
>>>> the probe, it replies right back.  If it didn't reply, it was
>>>> doing something else for an extended period of time.  "MANY" is
>>>> more than 5 seconds.
>>>>
>>>
>>> We are seeing the same issue here after moving to OVN relay.
>>>
>>> - On the relay "no response to inactivity probe after 5 seconds"
>>> - On the OVSDB cluster
>>>    - "Unreasonably long 1726ms poll interval"
>>>    - "connection dropped (Input/output error)"
>>>    - "SSL_write: system error (Broken pipe)"
>>>    - 100% CPU on northd process
>>>
>>> Is there anything we could look for on the OVSDB side to narrow down what 
>>> may be causing the load on the cluster side?
>>>
>>> A brief history - We are migrating an OpenStack cloud from MidoNet to OVN. 
>>> This cloud has roughly
>>>
>>> - 400 neutron networks / ovn logical switches
>>> - 300 neutron routers
>>> - 14000 neutron ports / ovn logical switchports
>>> - 28000 neutron security groups / ovn port group
>>> - 80000 neutron secgroup rules / acl
>>>
>>> We populated the OVN DB by using OpenStack/Neutron ovn sync script.
>>>
>>> We have attempted the migration twice previously (2021, 2022) but failed 
>>> due to load issues. We've reported issues and have seen lots of performance 
>>> improvements over the last two years. Here is a BIG thank you to the dev 
>>> teams!
>>>
>>> We are now on the following versions
>>>
>>> - OVS 2.17
>>> - OVN 22.03
>>>
>>> We are exploring upgrade as an option, but I am concerned if there's 
>>> something fundamentally wrong with the data / config we have that is 
>>> causing the high load, and would like to rule that out first. Please let me 
>>> know if you need more information, will be happy to start a new thread too.
>>
>> Hi, Jake.  Your scale numbers are fairly high, i.e. this number of
>> objects in the setup may indeed create a noticeable load.
>>
>> The fact that relay is disconnecting with only 1726ms poll intervals
>> on the main cluster side is a bit strange.  Not sure why this happened.
>> Normally it should be 5+ seconds.
> 
> There are multiple errors. I just grabbed the first I found; indeed there are 
> poll intervals >5secs like
> 
> ovs|05000|timeval|WARN|Unreasonably long 13209ms poll interval (12942ms user, 
> 264ms system)

Yeah, this one is pretty high.  Is it, by any chance, database compaction
related?  i.e. are there database compaction related logs in the close
proximity to this one?

In case all the huge poll intervals are compaction-related, upgrade to
OVS 3.0+ may completely solve the issue, since most of the compaction
work is moved into a separate thread there.

> 
>>
>> The versions you're using have an upgrade path with potentially
>> significant performance improvements, e.g OVS 3.1 + OVN 23.03.
>> Both ovsdb-server and core OVN components became much faster in
>> the previous year.
>>
> 
> Thanks for the work you've put into OVN. I've seen your conference 
> presentations and believe that is a valid way forward. One thing keeping us 
> back is that there are no Ubuntu packages for us. So we may need to build 
> them.

I see that Ubuntu 22.10 is providing OVS 3.0 + OVN 22.09, which is
a decent combination performance-wise.  But yeah, I'm not sure if
these can be easily installed on 22.04 or earlier.

> 
> We may also be exploring containers but is still not sure not sure how 
> containerised openvswitch works.
> 
> Another issue is if integration will work - we are using Neutron Yoga. I 
> believe OVN being able to be upgraded from one LTS to another means Neutron 
> Yoga should work with  OVS 3.1 + OVN 23.03 ?

CC: Frode, maybe you can answer that?

> 
>> I'm not sure if there is anything fundamentally wrong with your setup,
>> other than the total amount of resources.
>>
>> If you have a freedom of building your own packages and the relay
>> disconnection is the main problem in your setup, you may try something
>> like this:
>>
>> diff --git a/ovsdb/relay.c b/ovsdb/relay.c
>> index 9ff6ed8f3..5c5937c27 100644
>> --- a/ovsdb/relay.c
>> +++ b/ovsdb/relay.c
>> @@ -152,6 +152,7 @@ ovsdb_relay_add_db(struct ovsdb *db, const char *remote,
>>       shash_add(&relay_dbs, db->name, ctx);
>>       ovsdb_cs_set_leader_only(ctx->cs, false);
>>       ovsdb_cs_set_remote(ctx->cs, remote, true);
>> +    ovsdb_cs_set_probe_interval(ctx->cs, 16000);
>>         VLOG_DBG("added database: %s, %s", db->name, remote);
>>   }
>> ---
>>
>> This change will set 16 seconds as inactivity probe interval for
>> relay-to-server connection by default.
>>
>> Best regards, Ilya Maximets.
> 
> Thanks, we will keep this in mind; we may need to build packages after all.
> 
> Regards,
> Jake

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to