On 3/1/23 15:24, Brendan Doyle wrote:
> 
> 
> On 01/03/2023 13:13, Ilya Maximets wrote:
>> On 3/1/23 12:42, Brendan Doyle via discuss wrote:
>>> Hi Folks,
>>>
>>> Whilst launching a lot (150+) of VMs on a hypervisor we occasionally
>>> see errors like
>>> "libvirt: error : internal error: Unable to add port vnet114 to OVS bridge 
>>> br-int"
>>>
>>> in var/log/messages, for example:
>>>
>>>
>>> Feb 25 04:13:15 pcacn001 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl
>>> --timeout=5 -- --may-exist add-port br-int vnet87 -- set Interface vnet87
>>> "external-ids:attached-mac=\"00:13:97:42:ff:5d\"" -- set Interface vnet87
>>> "external-ids:iface-id=\"b652aef0-b4c2-11ed-9fba-a8698c05eb74\"" -- set
>>> Interface vnet87
>>> "external-ids:vm-id=\"ba70d0b3-9b47-48e8-9ed4-a8bf04bae901\"" -- set
>>> Interface vnet87 external-ids:iface-status=active
>>>
>>> Feb 25 04:13:20 pcacn001 journal: internal error: Child process (ovs-vsctl
>>> --timeout=5 -- --may-exist add-port br-int vnet87 -- set Interface vnet87
>>> 'external-ids:attached-mac="00:13:97:42:ff:5d"' -- set Interface vnet87
>>> 'external-ids:iface-id="b652aef0-b4c2-11ed-9fba-a8698c05eb74"' -- set
>>> Interface vnet87 'external-ids:vm-id="ba70d0b3-9b47-48e8-9ed4-a8bf04bae901"'
>>> -- set Interface vnet87 external-ids:iface-status=active) unexpected fatal
>>> signal 14: 2023-02-25T04:13:20Z|00002|fatal_signal|WARN|terminating with
>>> signal 14 (Alarm clock)
>>> Feb 25 04:13:20 pcacn001 journal: internal error: Unable to add port vnet87
>>> to OVS bridge br-int
>>>
>>>
>>> There is nothing suspicious in ovn-controller log, but the ovsdb-server and
>>> ovs-vswitchd logs for this time period do contain WARNs
>>>
>>> ovsdb-server.log
>>> ================
>>> 2023-02-25T04:13:08.896Z|108756|jsonrpc|WARN|unix#84848: receive error: 
>>> Connection reset by peer
>>> 2023-02-25T04:13:08.896Z|108757|reconnect|WARN|unix#84848: connection 
>>> dropped (Connection reset by peer)
>>> 2023-02-25T04:13:08.910Z|108758|reconnect|WARN|unix#84850: connection 
>>> dropped (Connection reset by peer)
>>> 2023-02-25T04:13:08.918Z|108759|reconnect|WARN|unix#84851: connection 
>>> dropped (Connection reset by peer)
>>> 2023-02-25T04:13:08.925Z|108760|reconnect|WARN|unix#84852: connection 
>>> dropped (Connection reset by peer)
>>> 2023-02-25T04:13:08.932Z|108761|reconnect|WARN|unix#84853: connection 
>>> dropped (Connection reset by peer)
>>> 2023-02-25T04:13:08.938Z|108762|reconnect|WARN|unix#84854: connection 
>>> dropped (Connection reset by peer)
>>> 2023-02-25T04:13:09.238Z|108763|reconnect|WARN|unix#84855: connection 
>>> dropped (Connection reset by peer)
>>> 2023-02-25T04:13:09.277Z|108764|reconnect|WARN|unix#84856: connection 
>>> dropped (Connection reset by peer)
>>> 2023-02-25T04:13:09.284Z|108765|reconnect|WARN|unix#84857: connection 
>>> dropped (Connection reset by peer)
>>> 2023-02-25T04:13:09.300Z|108766|reconnect|WARN|unix#84859: connection 
>>> dropped (Connection reset by peer)
>>> 2023-02-25T04:13:09.308Z|108767|reconnect|WARN|unix#84860: connection 
>>> dropped (Connection reset by peer)
>>> 2023-02-25T04:13:09.323Z|108768|reconnect|WARN|unix#84862: connection 
>>> dropped (Connection reset by peer)
>>> 2023-02-25T04:14:46.220Z|108769|jsonrpc|WARN|Dropped 11 log messages in 
>>> last 97 seconds (most recently, 97 seconds ago) due to excessive rate
>>> 2023-02-25T04:14:46.220Z|108770|jsonrpc|WARN|unix#84870: receive error: 
>>> Connection reset by peer
>>>
>>> ovs-vswitchd.log
>>> =================
>>> 2023-02-25T04:13:00.849Z|00258|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
>>> for main to quiesce
>>> 2023-02-25T04:13:01.694Z|04014|timeval|WARN|Unreasonably long 1910ms poll 
>>> interval (902ms user, 1007ms system)
>>> 2023-02-25T04:13:01.694Z|04015|timeval|WARN|faults: 21 minor, 0 major
>>> 2023-02-25T04:13:01.694Z|04016|timeval|WARN|context switches: 0 voluntary, 
>>> 2 involuntary
>>> 2023-02-25T04:13:01.694Z|04017|coverage|INFO|Dropped 12 log messages in 
>>> last 65 seconds (most recently, 40 seconds ago) due to excessive rate
>>> 2023-02-25T04:13:01.694Z|04018|coverage|INFO|Skipping details of duplicate 
>>> event coverage for hash=ce7b9f83
>>> 2023-02-25T04:13:01.722Z|04019|bridge|INFO|bridge br-int: added interface 
>>> vnet86 on port 204
>>> 2023-02-25T04:13:02.695Z|00259|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
>>> for main to quiesce
>>> 2023-02-25T04:13:03.676Z|04020|timeval|WARN|Unreasonably long 1982ms poll 
>>> interval (895ms user, 1084ms system)
>>> 2023-02-25T04:13:03.676Z|04021|timeval|WARN|faults: 1 minor, 0 major
>>> 2023-02-25T04:13:03.676Z|04022|timeval|WARN|context switches: 0 voluntary, 
>>> 5 involuntary
>>> 2023-02-25T04:13:04.676Z|00260|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
>>> for main to quiesce
>>> 2023-02-25T04:13:05.658Z|04023|timeval|WARN|Unreasonably long 1982ms poll 
>>> interval (922ms user, 1059ms system)
>>> 2023-02-25T04:13:05.658Z|04024|timeval|WARN|faults: 18 minor, 0 major
>>> 2023-02-25T04:13:05.658Z|04025|timeval|WARN|context switches: 2 voluntary, 
>>> 5 involuntary
>>> 2023-02-25T04:13:05.658Z|04026|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (101% CPU usage)
>>> 2023-02-25T04:13:06.672Z|00261|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
>>> for main to quiesce
>>> 2023-02-25T04:13:07.616Z|04027|timeval|WARN|Unreasonably long 1958ms poll 
>>> interval (916ms user, 1025ms system)
>>> 2023-02-25T04:13:07.616Z|04028|timeval|WARN|context switches: 3 voluntary, 
>>> 6 involuntary
>>> 2023-02-25T04:13:07.616Z|04029|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (101% CPU usage)
>>> 2023-02-25T04:13:08.616Z|00262|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
>>> for main to quiesce
>>> 2023-02-25T04:13:09.536Z|04030|timeval|WARN|Unreasonably long 1920ms poll 
>>> interval (919ms user, 999ms system)
>>> 2023-02-25T04:13:09.536Z|04031|timeval|WARN|faults: 2 minor, 0 major
>>> 2023-02-25T04:13:09.536Z|04032|timeval|WARN|context switches: 4 voluntary, 
>>> 3 involuntary
>>> 2023-02-25T04:13:09.536Z|04033|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 14 (NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (101% CPU 
>>> usage)
>>> 2023-02-25T04:13:09.537Z|04034|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 22 (FIFO pipe:[52276]) at vswitchd/bridge.c:423 (101% CPU usage)
>>> 2023-02-25T04:13:09.537Z|04035|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 529 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (101% CPU 
>>> usage)
>>> 2023-02-25T04:13:10.538Z|00263|ovs_rcu(urcu9)|WARN|blocked 1001 ms waiting 
>>> for main to quiesce
>>> 2023-02-25T04:13:11.512Z|04036|timeval|WARN|Unreasonably long 1976ms poll 
>>> interval (918ms user, 1023ms system)
>>> 2023-02-25T04:13:11.512Z|04037|timeval|WARN|faults: 3 minor, 0 major
>>> 2023-02-25T04:13:11.512Z|04038|timeval|WARN|context switches: 8 voluntary, 
>>> 5 involuntary
>>> 2023-02-25T04:13:11.512Z|04039|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 529 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (101% CPU 
>>> usage)
>>> 2023-02-25T04:13:11.522Z|04040|poll_loop|INFO|wakeup due to 0-ms timeout at 
>>> ofproto/ofproto-dpif.c:1976 (101% CPU usage)
>>> 2023-02-25T04:13:11.972Z|04041|poll_loop|INFO|Dropped 1 log messages in 
>>> last 1 seconds (most recently, 1 seconds ago) due to excessive rate
>>> 2023-02-25T04:13:11.972Z|04042|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 22 (FIFO pipe:[52276]) at vswitchd/bridge.c:423 (101% CPU usage)
>>> 2023-02-25T04:13:16.472Z|00264|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
>>> for main to quiesce
>>> 2023-02-25T04:13:17.472Z|00265|ovs_rcu(urcu9)|WARN|blocked 2000 ms waiting 
>>> for main to quiesce
>>> 2023-02-25T04:13:18.405Z|04043|connmgr|INFO|br-int<->unix#1: 60 flow_mods 
>>> in the 16 s starting 25 s ago (58 adds, 2 deletes)
>>> 2023-02-25T04:13:18.487Z|04044|timeval|WARN|Unreasonably long 3303ms poll 
>>> interval (1569ms user, 1731ms system)
>>> 2023-02-25T04:13:18.487Z|04045|timeval|WARN|faults: 11 minor, 0 major
>>> 2023-02-25T04:13:18.487Z|04046|timeval|WARN|context switches: 1 voluntary, 
>>> 8 involuntary
>>> 2023-02-25T04:13:18.487Z|04047|poll_loop|INFO|Dropped 10 log messages in 
>>> last 6 seconds (most recently, 3 seconds ago) due to excessive rate
>>> 2023-02-25T04:13:18.487Z|04048|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 14 (NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (57% CPU 
>>> usage)
>>> 2023-02-25T04:13:18.518Z|04049|bridge|INFO|bridge br-int: added interface 
>>> vnet87 on port 205
>>> 2023-02-25T04:13:19.487Z|00266|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
>>> for main to quiesce
>>> 2023-02-25T04:13:20.475Z|04050|ofproto|WARN|br-int: cannot get STP status 
>>> on nonexistent port 205
>>> 2023-02-25T04:13:20.475Z|04051|ofproto|WARN|br-int: cannot get RSTP status 
>>> on nonexistent port 205
>>> 2023-02-25T04:13:20.547Z|04052|timeval|WARN|Unreasonably long 2060ms poll 
>>> interval (910ms user, 1092ms system)
>>> 2023-02-25T04:13:20.548Z|04053|timeval|WARN|faults: 9 minor, 0 major
>>> 2023-02-25T04:13:20.548Z|04054|timeval|WARN|context switches: 8 voluntary, 
>>> 4 involuntary
>>> 2023-02-25T04:13:20.578Z|04055|bridge|INFO|bridge br-int: deleted interface 
>>> vnet87 on port 205
>>> 2023-02-25T04:13:20.581Z|04056|bridge|WARN|could not open network device 
>>> vnet87 (No such device)
>>> 2023-02-25T04:13:21.550Z|00267|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
>>> for main to quiesce
>>> 2023-02-25T04:13:22.509Z|04057|timeval|WARN|Unreasonably long 1961ms poll 
>>> interval (908ms user, 1048ms system)
>>> 2023-02-25T04:13:22.509Z|04058|timeval|WARN|faults: 1 minor, 0 major
>>> 2023-02-25T04:13:22.509Z|04059|timeval|WARN|context switches: 6 voluntary, 
>>> 7 involuntary
>>> 2023-02-25T04:13:24.026Z|04060|poll_loop|INFO|Dropped 12 log messages in 
>>> last 6 seconds (most recently, 1 seconds ago) due to excessive rate
>>>
>>>
>>>
>>> Any idea what's going on here? Should it be that ovs-vsctl timeout should 
>>> be longer?
>>> Who is sending the SIG 14 to it and why?
>> ovs-vsctl sends SIGALRM to itself after 5 seconds, because it is
>> started with --timeout=5.
>>
>>> What does thisovs-vswitchd  WARN mean:
>>> 2023-02-25T04:13:02.695Z|00259|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
>>> for main to quiesce
>> That means that the main thread is busy doing some work for at least
>> one second without entering quiescent state.  basically means that
>> one iteration in the main thread took a lot of time.
>>
>> Looking at this:
>>    Unreasonably long 2060ms poll interval (910ms user, 1092ms system)
>>
>> ovs-vswitchd is likely stuck somewhere inside the syscall in the
>> kernel for 1092ms.  That is a lot of time.
>>
>>> It seems that eventually the port is added even though ovs-vsctl cmd failed 
>>> :
>>> ovs-vsctl list-ports br-int | grep vnet87
>>> vnet87
>> Database is fine, so the port was added to it without issues, but
>> ovs-vsctl waits for ovs-vswitchd to act on database update.  That
>> didn't happen within 5 seconds, so ovs-vsctl reported an error.
>>
>> 'ovs-vsctl list-ports' is a pure database command, it shows the
>> database content.
>>
>> It's also likely that ovs-vswitch will actually add the port eventually
>> when it gets out from being stuck in the kernel.  But it's too late
>> for ovs-vsctl.
>>
>> You need to investigate what is going on with ovs-vswitchd, i.e.
>> what holds it inside the kernel for so long, of it it makes way
>> too many syscalls for some reason.
>>
>> Not sure how many cores you have or what OVS version you're using,
>> but this might be one of the possible causes of being stuck in
>> syscalls for a long time on systems with large number of cores with
> 
> A lot
> echo "Threads/core: $(nproc --all)"
> Threads/core: 384
> 
> OVS/OVN version:
> 
> [root@pcacn001 openvswitch]# ovs-vsctl -V
> ovs-vsctl (Open vSwitch) 2.16.90
> DB Schema 8.3.0
> [root@pcacn001 openvswitch]# ovn-nbctl -V
> ovn-nbctl 21.09.0
> Open vSwitch Library 2.16.90
> DB Schema 5.32.1
> 
>> big number of ports:
>>    
>> https://urldefense.com/v3/__https://github.com/openvswitch/ovs/commit/c0e053f6d__;!!ACWV5N9M2RV99hQ!JinadxFbxISvtiYFYAlsGUE6tu3ydjDzfZMLdMyVVnE8TggGV20_M4vFCujhiZNfg72DM1ucWXQOxjfXmb4Z_w$
> 
> Thanks, I'll take a look

Yeah, with that many cores and ports, you should definitely try that patch.
It was also backported to stable branches.  On current LTS branch it is
available since v2.17.3.

>> Best regards, Ilya Maximets.
> 

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to