Hi,

We are moving to the recently released ovs-2.4 and are seeing some random crashes during ovsdb periodic updates into vswitchd. Some preliminary analysis is below. The crash typically happens after we've successfully brought up ovs and have downloaded some configs and after we start initiating some traffic but no particular pattern. It appears ovs main thread crashes while trying to periodically update ovsdb with stats and controller table updates. Below debugging is the case when its trying
controller table update.

(gdb) bt
#0  0x00007f9532a052b6 in __strcmp_sse42 () from /lib64/libc.so.6
#1  0x00000000004b7b42 in atom_arrays_compare_3way (a=0xa9b898,
    b=0x7fff8a0359e0, type=0x896df0) at lib/ovsdb-data.c:1582
#2  ovsdb_datum_compare_3way (a=0xa9b898, b=0x7fff8a0359e0, type=0x896df0)
    at lib/ovsdb-data.c:1616
#3  0x00000000004b7b69 in ovsdb_datum_equals (a=<value optimized out>,
    b=<value optimized out>, type=<value optimized out>)
    at lib/ovsdb-data.c:1596
#4  0x00000000004bb36e in ovsdb_idl_txn_write__ (row_=0xa9b5b0,
    column=0x896de8, datum=0x7fff8a0359e0, owns_datum=true)
    at lib/ovsdb-idl.c:2087
#5  0x00000000004f7d24 in ovsrec_controller_set_status (row=0xa9b5b0,
    status=0xc8a308) at lib/vswitch-idl.c:5254
#6  0x0000000000411d5b in refresh_controller_status ()
    at vswitchd/bridge.c:2741
#7  run_stats_update () at vswitchd/bridge.c:2801
#8  bridge_run () at vswitchd/bridge.c:3073
#9  0x00000000004121ad in main (argc=10, argv=0x7fff8a035c38)
    at vswitchd/ovs-vswitchd.c:131

Looking a bit deeper we find one of the array elements of what is being read from incore idl is corrupt.

(gdb) frame 1
#1  0x00000000004b7b42 in atom_arrays_compare_3way (a=0xa9b898,
    b=0x7fff8a0359e0, type=0x896df0) at lib/ovsdb-data.c:1582
1582            int cmp = ovsdb_atom_compare_3way(&a[i], &b[i], type);
(gdb) p a
$1 = (const union ovsdb_atom *) 0xc38f10
(gdb) p a[0]
$2 = {integer = 8, real = 3.9525251667299724e-323, boolean = 8,
  string = 0x8 <Address 0x8 out of bounds>, uuid = {parts = {8, 0, 13182880,
      0}}}
(gdb) p a[1]
$3 = {integer = 11240608, real = 5.5535982511682826e-317, boolean = 160,
  string = 0xab84a0 "288", uuid = {parts = {11240608, 0, 13183312, 0}}}
(gdb) p a[2]
$4 = {integer = 10932464, real = 5.4013548867961776e-317, boolean = 240,
  string = 0xa6d0f0 "296", uuid = {parts = {10932464, 0, 13183680, 0}}}
(gdb) p a[3]
$5 = {integer = 11271008, real = 5.5686178468018565e-317, boolean = 96,
  string = 0xabfb60 "ACTIVE", uuid = {parts = {11271008, 0, 13184096, 0}}}
(gdb) p a[4]
$6 = {integer = 13126128, real = 6.4851689077148698e-317, boolean = 240,
  string = 0xc849f0 "\300", uuid = {parts = {13126128, 0, 33, 0}}}

The above tends to indicate that we are tripping because of a bad pointer, viz. a[0]->string which curiously has only one bad value.

(gdb) p type
$7 = OVSDB_TYPE_STRING

However, all other values appear to be good including the ones that are about to be written.

(gdb) p b[0]
$9 = {integer = 12475120, real = 6.1635282197470516e-317, boolean = 240,
string = 0xbe5af0 "Connection timed out", uuid = {parts = {12475120, 0, 44,
      1}}}
(gdb) p b[1]
$10 = {integer = 12475056, real = 6.1634965995457177e-317, boolean = 176,
  string = 0xbe5ab0 "293", uuid = {parts = {12475056, 0, 34, 1}}}
(gdb) p b[2]
$11 = {integer = 11010656, real = 5.4399868677757963e-317, boolean = 96,
  string = 0xa80260 "301", uuid = {parts = {11010656, 0, 851889880, 32661}}}
(gdb) p b[3]
$12 = {integer = 11010720, real = 5.4400184879771301e-317, boolean = 160,
  string = 0xa802a0 "ACTIVE", uuid = {parts = {11010720, 0, 27, 1}}}

Mapping the above UUID to the table, it appears the table is sane.

(gdb) frame 5
#5  0x00000000004f7d24 in ovsrec_controller_set_status (row=0xa9b5b0,
    status=0xc8a308) at lib/vswitch-idl.c:5254
5254        ovsdb_idl_txn_write(&row->header_,

(gdb) p/x row->header_
$20 = {hmap_node = {hash = 0xea5d6304, next = 0x0}, uuid = {parts = {
      0xea5d6304, 0x328a492f, 0xbabbae6c, 0xa26600f1}}, src_arcs = {
    prev = 0xa9b5d0, next = 0xa9b5d0}, dst_arcs = {prev = 0xa9fd40,
    next = 0xa9fd40}, table = 0xa584f0, old = 0xa9b730, new = 0xa9b730,
  prereqs = 0x0, written = 0x0, txn_node = {hash = 0xea5d6304, next = 0x1}}

Fishing for UUID ea5d6304 in ovsdb-client dump:

[root@ovs-1 ~]# ovsdb-client dump Controller
Controller table
_uuid config_role connection_mode controller_burst_limit controller_rate_limit enable_async_messages external_ids inactivity_probe is_connected local_gateway local_ip local_netmask max_backoff name other_config role status target ------------------------------------ ----------- --------------- ---------------------- --------------------- --------------------- ------------ ---------------- ------------ ------------- -------- ------------- ----------- ------- ------------ ------ ------------------------------------------------------------------------------------------------------ --------------------- 9b64de9b-d55d-4065-b534-4708c18b780a master [] [] [] [] {} 5000 true [] [] [] [] "ctrl1" {} master {last_error="No route to host", sec_since_connect="297", sec_since_disconnect="312", state=ACTIVE} "tcp:10.10.13.7:6633" ea5d6304-328a-492f-babb-ae6ca26600f1 slave [] [] [] [] {} 5000 true [] [] [] [] "ctrl2" {} slave {last_error="Connection timed out", sec_since_connect="288", sec_since_disconnect="296", state=ACTIVE} "tcp:10.10.15.9:6633"


Can anyone please look and let me know if this has been seen or if there is a patch or fix that can address it?

Thanks,
Sabya
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to