On 1/29/24 23:25, Ilya Maximets wrote:
On 1/29/24 20:13, Daryl Wang via discuss wrote:
After upgrading from Open vSwitch 3.1.2-4 to 3.2.0-2 we've been consistently
seeing new OVS crashes when setting QoS on ports. Both packages were taken
from the Debian distribution (https://packages.debian.org/source/sid/openvswitch
we're running on. From the core dump we're seeing the following backtrace:

# gdb --batch -ex bt /usr/sbin/ovs-vswitchd /core
[NewLWP 67669]
[NewLWP 67682]
[NewLWP 67681]
[NewLWP 67671]
[NewLWP 67679]
[NewLWP 67680]
[Threaddebugging usinglibthread_db enabled]
Usinghost libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Corewas generated by`ovs-vswitchd unix:/tmp/ovs/db.sock -vconsole:emer 
-vsyslog:err -vfile:info --ml'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7fcacf4cfa80 (LWP 67669))]
#0  0x00007fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fcacec6d472 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fcacec574b2 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x0000560787952c7e in ovs_abort_valist (err_no=<optimized out>, 
format=<optimized out>, args=args@entry=0x7ffd14c3bce0) at ../lib/util.c:447
#4  0x0000560787952d14 in ovs_abort (err_no=<optimized out>, 
format=format@entry=0x5607879f3ec7 "%s: pthread_%s_%s failed") at ../lib/util.c:439
#5  0x000056078791ee11 in ovs_mutex_lock_at (l_=l_@entry=0x56078934c6c8, 
where=where@entry=0x5607879f864b "../lib/netdev-linux.c:2575") at 
../lib/ovs-thread.c:76
#6  0x000056078796d03d in netdev_linux_get_speed (netdev_=0x56078934c640, 
current=0x7ffd14c3be64, max=0x7ffd14c3be1c) at ../lib/netdev-linux.c:2575
#7  0x00005607878c04f3 in netdev_get_speed (netdev=netdev@entry=0x56078934c640, 
current=current@entry=0x7ffd14c3be64, max=0x7ffd14c3be1c, max@entry=0x0) at 
../lib/netdev.c:1175
#8  0x0000560787968d67 in htb_parse_qdisc_details__ 
(netdev=netdev@entry=0x56078934c640, details=details@entry=0x56078934a880, 
hc=hc@entry=0x7ffd14c3beb0) at ../lib/netdev-linux.c:4804
#9  0x00005607879755da in htb_tc_install (details=0x56078934a880, 
netdev=0x56078934c640) at ../lib/netdev-linux.c:4883
#10 htb_tc_install (netdev=0x56078934c640, details=0x56078934a880) at 
../lib/netdev-linux.c:4875
#11 0x0000560787974937 in netdev_linux_set_qos (netdev_=0x56078934c640, 
type=<optimized out>, details=0x56078934a880) at ../lib/netdev-linux.c:3054
#12 0x0000560787814ea5 in iface_configure_qos (qos=0x56078934a780, 
iface=0x560789349fd0) at ../vswitchd/bridge.c:4845
#13 bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x5607892dbf90) at 
../vswitchd/bridge.c:928
#14 0x0000560787817f7d in bridge_run () at ../vswitchd/bridge.c:3321
#15 0x000056078780d205 in main (argc=<optimized out>, argv=<optimized out>) at 
../vswitchd/ovs-vswitchd.c:130

A shell script to reproduce the issue is:

#!/bin/sh

apt-get install openvswitch-{common,switch}{,-dbgsym}

# Don't need it running on the system
systemctl stop openvswitch-switch

set-e

cleanup(){
   ip link delveth0
   rm /tmp/ovs/conf.db
}

trap cleanup EXIT

# Setup our environment

ip link add veth0 type veth peer veth1

mkdir -p /tmp/ovs

exportOVS_RUNDIR=/tmp/ovs
exportOVS_LOGDIR=/tmp/ovs
exportOVS_DBDIR=/tmp/ovs

/usr/share/openvswitch/scripts/ovs-ctl start
ovs-vsctl add-br demo
ovs-vsctl add-port demo veth1

# Make it crash

ovs-vsctl setPortveth1 qos=@qos\
   -- --id=@qoscreate QoStype=linux-htb queues:1=@highqueues:2=@low\
   -- --id=@highcreate Queueother-config:priority=1other-config:min-rate=0.1\
   -- --id=@low  create Queueother-config:priority=6other-config:min-rate=0.05

We built the reproduction script based on speculation that
https://github.com/openvswitch/ovs/commit/6240c0b4c80ea3d8dd1bf77526b04b55742de2ce
is related to the crash. Notably we don't seem to run into the problem when we
pass in a specific maximum bandwidth instead of relying on the interface's 
maximum
bandwidth.

Hi, Daryl.  Thanks for the report!

Looking at the stack trace, the root cause seems to be the following commit:
   
https://github.com/openvswitch/ovs/commit/b8f8fad8643518551cf742056ae8728c936674c6

It introduced the netdev_get_speed() call in QoS functions.
These functions are running under netdev mutex and the netdev_get_speed()
function is trying to take the same mutex for a second time.  That fails
with 'deadlock avoided' or something like that.

 From the commit I linked it's clear why it's not happening if the max-rate
is specified.  The code just doesn't go that route.

To fix the issue, we need to have a lockless version of netdev_linux_get_speed()
and call it directly from the QoS functions without going through generic
netdev API.

Adrian, since it was your original patch, could you, please, take a look
at the issue?


Sure, I'll take a look at it.


Or Daryl, maybe you want to fix it yourself?
 > Best regards, Ilya Maximets.


--
Adrián Moreno

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to