On 1/29/24 20:13, Daryl Wang via discuss wrote:
> After upgrading from Open vSwitch 3.1.2-4 to 3.2.0-2 we've been consistently
> seeing new OVS crashes when setting QoS on ports. Both packages were taken
> from the Debian distribution 
> (https://packages.debian.org/source/sid/openvswitch
> we're running on. From the core dump we're seeing the following backtrace:
> 
> # gdb --batch -ex bt /usr/sbin/ovs-vswitchd /core 
> [NewLWP 67669]
> [NewLWP 67682]
> [NewLWP 67681]
> [NewLWP 67671]
> [NewLWP 67679]
> [NewLWP 67680]
> [Threaddebugging usinglibthread_db enabled]
> Usinghost libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Corewas generated by`ovs-vswitchd unix:/tmp/ovs/db.sock -vconsole:emer 
> -vsyslog:err -vfile:info --ml'.
> Program terminated with signal SIGABRT, Aborted.
> #0  0x00007fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> [Current thread is 1 (Thread 0x7fcacf4cfa80 (LWP 67669))]
> #0  0x00007fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007fcacec6d472 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x00007fcacec574b2 in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x0000560787952c7e in ovs_abort_valist (err_no=<optimized out>, 
> format=<optimized out>, args=args@entry=0x7ffd14c3bce0) at ../lib/util.c:447
> #4  0x0000560787952d14 in ovs_abort (err_no=<optimized out>, 
> format=format@entry=0x5607879f3ec7 "%s: pthread_%s_%s failed") at 
> ../lib/util.c:439
> #5  0x000056078791ee11 in ovs_mutex_lock_at (l_=l_@entry=0x56078934c6c8, 
> where=where@entry=0x5607879f864b "../lib/netdev-linux.c:2575") at 
> ../lib/ovs-thread.c:76
> #6  0x000056078796d03d in netdev_linux_get_speed (netdev_=0x56078934c640, 
> current=0x7ffd14c3be64, max=0x7ffd14c3be1c) at ../lib/netdev-linux.c:2575
> #7  0x00005607878c04f3 in netdev_get_speed 
> (netdev=netdev@entry=0x56078934c640, current=current@entry=0x7ffd14c3be64, 
> max=0x7ffd14c3be1c, max@entry=0x0) at ../lib/netdev.c:1175
> #8  0x0000560787968d67 in htb_parse_qdisc_details__ 
> (netdev=netdev@entry=0x56078934c640, details=details@entry=0x56078934a880, 
> hc=hc@entry=0x7ffd14c3beb0) at ../lib/netdev-linux.c:4804
> #9  0x00005607879755da in htb_tc_install (details=0x56078934a880, 
> netdev=0x56078934c640) at ../lib/netdev-linux.c:4883
> #10 htb_tc_install (netdev=0x56078934c640, details=0x56078934a880) at 
> ../lib/netdev-linux.c:4875
> #11 0x0000560787974937 in netdev_linux_set_qos (netdev_=0x56078934c640, 
> type=<optimized out>, details=0x56078934a880) at ../lib/netdev-linux.c:3054
> #12 0x0000560787814ea5 in iface_configure_qos (qos=0x56078934a780, 
> iface=0x560789349fd0) at ../vswitchd/bridge.c:4845
> #13 bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x5607892dbf90) at 
> ../vswitchd/bridge.c:928
> #14 0x0000560787817f7d in bridge_run () at ../vswitchd/bridge.c:3321
> #15 0x000056078780d205 in main (argc=<optimized out>, argv=<optimized out>) 
> at ../vswitchd/ovs-vswitchd.c:130
> 
> A shell script to reproduce the issue is:
> 
> #!/bin/sh
> 
> apt-get install openvswitch-{common,switch}{,-dbgsym}
> 
> # Don't need it running on the system
> systemctl stop openvswitch-switch
> 
> set-e
> 
> cleanup(){
>   ip link delveth0
>   rm /tmp/ovs/conf.db
> }
> 
> trap cleanup EXIT
> 
> # Setup our environment
> 
> ip link add veth0 type veth peer veth1
> 
> mkdir -p /tmp/ovs
> 
> exportOVS_RUNDIR=/tmp/ovs 
> exportOVS_LOGDIR=/tmp/ovs
> exportOVS_DBDIR=/tmp/ovs
> 
> /usr/share/openvswitch/scripts/ovs-ctl start
> ovs-vsctl add-br demo
> ovs-vsctl add-port demo veth1
> 
> # Make it crash
> 
> ovs-vsctl setPortveth1 qos=@qos\
>   -- --id=@qoscreate QoStype=linux-htb queues:1=@highqueues:2=@low\
>   -- --id=@highcreate Queueother-config:priority=1other-config:min-rate=0.1\
>   -- --id=@low  create Queueother-config:priority=6other-config:min-rate=0.05
> 
> We built the reproduction script based on speculation that
> https://github.com/openvswitch/ovs/commit/6240c0b4c80ea3d8dd1bf77526b04b55742de2ce
> is related to the crash. Notably we don't seem to run into the problem when we
> pass in a specific maximum bandwidth instead of relying on the interface's 
> maximum
> bandwidth.

Hi, Daryl.  Thanks for the report!

Looking at the stack trace, the root cause seems to be the following commit:
  
https://github.com/openvswitch/ovs/commit/b8f8fad8643518551cf742056ae8728c936674c6

It introduced the netdev_get_speed() call in QoS functions.
These functions are running under netdev mutex and the netdev_get_speed()
function is trying to take the same mutex for a second time.  That fails
with 'deadlock avoided' or something like that.

From the commit I linked it's clear why it's not happening if the max-rate
is specified.  The code just doesn't go that route.

To fix the issue, we need to have a lockless version of netdev_linux_get_speed()
and call it directly from the QoS functions without going through generic
netdev API.

Adrian, since it was your original patch, could you, please, take a look
at the issue?

Or Daryl, maybe you want to fix it yourself?

Best regards, Ilya Maximets.
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to