On 1/29/24 20:13, Daryl Wang via discuss wrote: > After upgrading from Open vSwitch 3.1.2-4 to 3.2.0-2 we've been consistently > seeing new OVS crashes when setting QoS on ports. Both packages were taken > from the Debian distribution > (https://packages.debian.org/source/sid/openvswitch > we're running on. From the core dump we're seeing the following backtrace: > > # gdb --batch -ex bt /usr/sbin/ovs-vswitchd /core > [NewLWP 67669] > [NewLWP 67682] > [NewLWP 67681] > [NewLWP 67671] > [NewLWP 67679] > [NewLWP 67680] > [Threaddebugging usinglibthread_db enabled] > Usinghost libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > Corewas generated by`ovs-vswitchd unix:/tmp/ovs/db.sock -vconsole:emer > -vsyslog:err -vfile:info --ml'. > Program terminated with signal SIGABRT, Aborted. > #0 0x00007fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6 > [Current thread is 1 (Thread 0x7fcacf4cfa80 (LWP 67669))] > #0 0x00007fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6 > #1 0x00007fcacec6d472 in raise () from /lib/x86_64-linux-gnu/libc.so.6 > #2 0x00007fcacec574b2 in abort () from /lib/x86_64-linux-gnu/libc.so.6 > #3 0x0000560787952c7e in ovs_abort_valist (err_no=<optimized out>, > format=<optimized out>, args=args@entry=0x7ffd14c3bce0) at ../lib/util.c:447 > #4 0x0000560787952d14 in ovs_abort (err_no=<optimized out>, > format=format@entry=0x5607879f3ec7 "%s: pthread_%s_%s failed") at > ../lib/util.c:439 > #5 0x000056078791ee11 in ovs_mutex_lock_at (l_=l_@entry=0x56078934c6c8, > where=where@entry=0x5607879f864b "../lib/netdev-linux.c:2575") at > ../lib/ovs-thread.c:76 > #6 0x000056078796d03d in netdev_linux_get_speed (netdev_=0x56078934c640, > current=0x7ffd14c3be64, max=0x7ffd14c3be1c) at ../lib/netdev-linux.c:2575 > #7 0x00005607878c04f3 in netdev_get_speed > (netdev=netdev@entry=0x56078934c640, current=current@entry=0x7ffd14c3be64, > max=0x7ffd14c3be1c, max@entry=0x0) at ../lib/netdev.c:1175 > #8 0x0000560787968d67 in htb_parse_qdisc_details__ > (netdev=netdev@entry=0x56078934c640, details=details@entry=0x56078934a880, > hc=hc@entry=0x7ffd14c3beb0) at ../lib/netdev-linux.c:4804 > #9 0x00005607879755da in htb_tc_install (details=0x56078934a880, > netdev=0x56078934c640) at ../lib/netdev-linux.c:4883 > #10 htb_tc_install (netdev=0x56078934c640, details=0x56078934a880) at > ../lib/netdev-linux.c:4875 > #11 0x0000560787974937 in netdev_linux_set_qos (netdev_=0x56078934c640, > type=<optimized out>, details=0x56078934a880) at ../lib/netdev-linux.c:3054 > #12 0x0000560787814ea5 in iface_configure_qos (qos=0x56078934a780, > iface=0x560789349fd0) at ../vswitchd/bridge.c:4845 > #13 bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x5607892dbf90) at > ../vswitchd/bridge.c:928 > #14 0x0000560787817f7d in bridge_run () at ../vswitchd/bridge.c:3321 > #15 0x000056078780d205 in main (argc=<optimized out>, argv=<optimized out>) > at ../vswitchd/ovs-vswitchd.c:130 > > A shell script to reproduce the issue is: > > #!/bin/sh > > apt-get install openvswitch-{common,switch}{,-dbgsym} > > # Don't need it running on the system > systemctl stop openvswitch-switch > > set-e > > cleanup(){ > ip link delveth0 > rm /tmp/ovs/conf.db > } > > trap cleanup EXIT > > # Setup our environment > > ip link add veth0 type veth peer veth1 > > mkdir -p /tmp/ovs > > exportOVS_RUNDIR=/tmp/ovs > exportOVS_LOGDIR=/tmp/ovs > exportOVS_DBDIR=/tmp/ovs > > /usr/share/openvswitch/scripts/ovs-ctl start > ovs-vsctl add-br demo > ovs-vsctl add-port demo veth1 > > # Make it crash > > ovs-vsctl setPortveth1 qos=@qos\ > -- --id=@qoscreate QoStype=linux-htb queues:1=@highqueues:2=@low\ > -- --id=@highcreate Queueother-config:priority=1other-config:min-rate=0.1\ > -- --id=@low create Queueother-config:priority=6other-config:min-rate=0.05 > > We built the reproduction script based on speculation that > https://github.com/openvswitch/ovs/commit/6240c0b4c80ea3d8dd1bf77526b04b55742de2ce > is related to the crash. Notably we don't seem to run into the problem when we > pass in a specific maximum bandwidth instead of relying on the interface's > maximum > bandwidth.
Hi, Daryl. Thanks for the report! Looking at the stack trace, the root cause seems to be the following commit: https://github.com/openvswitch/ovs/commit/b8f8fad8643518551cf742056ae8728c936674c6 It introduced the netdev_get_speed() call in QoS functions. These functions are running under netdev mutex and the netdev_get_speed() function is trying to take the same mutex for a second time. That fails with 'deadlock avoided' or something like that. From the commit I linked it's clear why it's not happening if the max-rate is specified. The code just doesn't go that route. To fix the issue, we need to have a lockless version of netdev_linux_get_speed() and call it directly from the QoS functions without going through generic netdev API. Adrian, since it was your original patch, could you, please, take a look at the issue? Or Daryl, maybe you want to fix it yourself? Best regards, Ilya Maximets. _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss