On 28/10/2021 18:53, Numan Siddique wrote:
On Thu, Oct 28, 2021 at 12:21 PM Brendan Doyle <[email protected]> wrote:
I'm also hoping that this is the reason for the frequent SEGV's we see,
this a stacktrace that looks like:
Core was generated by `ovs-vswitchd unix:/var/run/openvswitch/db.sock
-vconsole:emer -vsyslog:err -vfi'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f69b4477e84 in classifier_lookup__ (cls=0x55edff13e8b8,
version=version@entry=72104, flow=flow@entry=0x7f699abb2490,
wc=wc@entry=0x7f699abd4730,
allow_conjunctive_matches=allow_conjunctive_matches@entry=true)
at lib/classifier.c:941
Full trace attached (almost 8000 frames deep!) when ovs-vswitchd SEGV's
like this systemd
can't restart it, as the the DB file is too large:
IMO ovs-vswitchd should not crash if the ovn-controller does something wrong.
Seems like a bug in vswitchd.
I agree, and will file a bug the core is quite large so don't know if I
can send
to the bug email. Looks to me might have been a stack overrun.
Thanks
Numan
ls -lh /etc/openvswitch/conf.db
-rw-r--r--. 1 root root 4.2G Oct 28 11:25 /etc/openvswitch/conf.db
It just times out.
Anyways we'll try the pactch and hopefully that will solve the problem.
On 28/10/2021 16:41, Numan Siddique wrote:
On Thu, Oct 28, 2021 at 5:20 AM Brendan Doyle <[email protected]> wrote:
Numan,
Just wondering if you got a chance to look at those logs?
I looked into the logs, and as I had mentioned earlier you need this
fix -
https://urldefense.com/v3/__https://github.com/ovn-org/ovn/commit/e7788554a7f5e824fc0d8afc6cbf20e94fe4245f__;!!ACWV5N9M2RV99hQ!amdtq3tQhwFCtbvjxSuF5ItzNk_07I0bBJvt5mu3lbJc-NBU5rsCp9IIullXTrxBXf8$
Please let me know if you still see this issue with the latest OVN or
with the version of OVN which has this fix.
This fix is available from OVN 21.03 and onwards.
Thanks
Numan
Thanks
Brendan
On 27/10/2021 11:25, Brendan Doyle wrote:
Hi,
I finally got some debug logs, truncated after the failure occurs, the
truncated entries just
are repeated updates of the same entry.
So some more light on this, It seems this is a timing issue. The test being run
involves
creating a number of Logical switches (LS), Routers (LR) and Distributed
Router Port
gateways (DR). And then immediately deleting them, with the last created DR
being
deleted first. Our CMs is using the ovsdbapp python lib to do this.
So it occurs to me that perhaps the objects get created in NB, but before they
have been
propagated to SB and to the HV chassis, we get the delete, and this causes
updates to
be sent to the chassis for a logical port that does not exist? Just a
hypothesis.
The ovn-nbctl has synchronization flags (--wait) to guard against such
behavior, does
ovsdbapp I wonder?
In any-case the test fails (we see a runaway conf.db) pretty regularly, but not
every time.
The failure is always observed on the delete operations. If I put a delay after
create and
before delete, then we don't see the failure.
If anyone can shed light on this from the logs would be much appreciated.
Thanks
Brendan
On 26/10/2021 17:11, Brendan Doyle wrote:
On 26/10/2021 15:50, Numan Siddique wrote:
On Tue, Oct 26, 2021 at 8:20 AM Brendan Doyle <[email protected]> wrote:
Hi,
So what is very odd here, is that I have used ovn-nbctl to delete the NB
config, so
# ovn-nbctl show
# ovn-sbctl lflow-list
Yet I still see /etc/openvswitch/conf.db growing with updates for
Logical switch ports that no longer exist!
"],["ct-zone-ln-ls_vcn9195577_external_ugw","220"],["ct-zone-ln-ls_vcn9206002_external_igw","110"],["ct-zone-ln-ls_vcn9210052_external_igw","110"],["ct-zone-ln-ls_vcn9232395_external_ugw","75"],["ct-zone-ln-ls_vcn9236987_external_igw","110"],["ct-zone-ln-ls_vcn9236987_external_ugw","78"],["ct-zone-ln-ls_vcn9255861_external_igw","118"],["ct-zone-ln-ls_vcn9255861_external_ugw","100"],["ct-zone-ln-ls_vcn9319435_external_igw","87"],["ct-zone-ln-ls_vcn9352502_external_igw","40"],["ct-zone-ln-ls_vcn9402504_external_ugw","99"],["ct-zone-ln-ls_vcn9403404_external_igw","133"],["ct-zone-ln-ls_vcn9403404_external_ugw","114"],["ct-zone-ln-ls_vcn9461566_external_ugw","191"],["ct-zone-ln-ls_vcn9480000_external_igw","254"],["ct-zone-ln-ls_vcn9480000_external_ugw","236"],["ct-zone-ln-ls_vcn9492134_external_igw","262"],["ct-zone-ln-ls_vcn9523503_external_igw","207"],["ct-zone-ln-ls_vcn9542102_external_igw","133"],["ct-zone-ln-ls_vcn9542102_external_ugw","115"],["ct-zone-ln-ls_vcn9559658_external
_igw","125"],["ct-zone-ln-ls_vcn9559658_external_ugw","78"],["ct-zone-ln-ls_vcn9594034_external_igw","49"],["ct-zone-ln-ls_vcn9619021_external_igw","133"],["ct-zone-ln-ls_vcn9634773_external_igw","292"],["ct-zone-ln-ls_vcn9649169_external_igw","132"],["ct-zone-ln-ls_vcn9649169_external_ugw","110"],["ct-zone-ln-ls_vcn9661290_external_ugw","78"],["ct-zone-ln-ls_vcn9734192_external_ugw","114"],["ct-zone-ln-ls_vcn9774252_external_igw","262"],["ct-zone-ln-ls_vcn9796262_external_igw","72"],["ct-zone-ln-ls_vcn9796262_external_ugw","54"],["ct-zone-ln-ls_vcn9805903_external_igw","147"],["ct-zone-ln-ls_vcn9805903_external_ugw","126"],["ct-zone-ln-ls_vcn9809895_external_igw","246"],["ct-zone-ln-ls_vcn9812576_external_ugw","78"],["ct-zone-ln-ls_vcn9834728_external_igw","110"],["ct-zone-ln-ls_vcn9886683_external_ugw","114"],["ct-zone-ln-ls_vcn9903419_external_ugw","235"],["ct-zone-ln-ls_vcn9917510_external_igw","56"],["ct-zone-ln-ls_vcn9917510_external_ugw","38"]]]}},"_comment":"ovn-controller:
modifying OVS tunnels 'pcacn001'"}
A shortened version of one entry Could it be that switch ports must be
deleted before
deleting the switch? I was under the impression once a switch is deleted
it's ports get deleted?
Yes. If you delete the switch, the switch ports get deleted too.
After deleting the logical switch (or switch ports) do you see them to
be deleted by
ovn-northd in SB DB ?
Run - ovn-sbctl list port_binding <deleted_port>
or/and
ovn-sbctl list datapath_binding <deleted_lswitch>
I'd suggest you enable jsonrpc debug in ovn-controller and see what's happening.
It would be helpful if you can share the ovn-controller debug logs.
ovn-appctl -t ovn-controller vlog/set jsonrpc:dbg
So in my test I create a simple network then delete it so NB DB and SB DB
are empty.
# ovn-sbctl list port_binding
# ovn-sbctl list datapath_binding
#
The network has a number of LS's and LR's and two Distributed Router (DR) ports
(on
separate LRs). When I just create one DR all seems fine, but when I add the
second into
the mix I get a runaway openvswitch/conf.db but NOT on all chassis. I have 4
chassis
that I can schedule the DR ports to. In this latest test I observed the
runaway conf.db
on pcacn003 & pcacn005. The logs are too large to send in email, is there an
ftp server
that I can upload to?
I will redo with debug enabled and collect updated logs. The conf.db on both
pcacn003 &
pcacn005 is several GBs.
The only way to recover is to stop the OVS/OVN procs, then delete
/etc/openvswitch/conf.db
and restart them.
Brendan
Thanks
Numan
switch 712757c3-2481-4f8b-940c-05dc13ce37a5 (ls_vcn9319435_external_ugw)
port ls_vcn9319435_external_ugw-lr_vcn9319435
type: router
router-port: lr_vcn9319435-ls_vcn9319435_external_ugw
port ln-ls_vcn9319435_external_ugw
type: localnet
addresses: ["unknown"]
router 80c281af-319b-416b-8a17-0ce7b8901bb1 (lr_vcn9319435)
port lr_vcn9319435-ls_vcn9319435_external_ugw
mac: "00:13:97:88:31:90"
networks: ["253.255.80.4/16"]
gateway chassis: [pcacn002 pcacn003 pcacn001]
port lr_vcn9319435-lsb_vcn9319435
mac: "00:13:97:d4:26:ec"
networks: ["253.255.29.2/25"]
nat 6c87050f-cd27-423e-815e-deda74bd9bc6
external ip: "253.255.80.4"
logical ip: "10.221.0.0/16"
type: "snat"
Do each port have to be deleted or is it ok to just delete the switch
and router?
Brendan
On 25/10/2021 16:10, Brendan Doyle wrote:
On 25/10/2021 15:08, Numan Siddique wrote:
On Fri, Oct 22, 2021 at 9:30 AM Brendan Doyle
<[email protected]> wrote:
Hi,
Looking at /etc/openvswitch/conf.db I see it getting very large:
[root@pcacn001 ~]# ls -l /etc/openvswitch/conf.db
-rw-r--r--. 1 root root 6069248828 Oct 22 11:55
/etc/openvswitch/conf.db
And has lots and lots (mostly) "ovn-controller: modifying OVS tunnels"
updates entries, like below.
What are these? it does not seem normal?
OVSDB JSON 4687 00e8788dd5d9af2aac5ca7724759017c52ddd580
{"_date":1634903752117,"Bridge":{"745726c4-0451-4f52-a52b-1f9c5e85c703":{"external_ids":["map",[["ct-zone-0dca7370-1c18-4117-84e4-a72f277ccc6c_dnat","4"],["ct-zone-0dca7370-1c18-4117-84e4-a72f277ccc6c_snat","1"],["ct-zone-11637f38-8725-4c77-adfe-f9c4c804ae8c_dnat","4"],["ct-zone-11637f38-8725-4c77-adfe-f9c4c804ae8c_snat","5"],["ct-zone-1de487d1-f3a5-4b15-bae4-aa8cf794fcf9_dnat","17"],["ct-zone-1de487d1-f3a5-4b15-bae4-aa8cf794fcf9_snat","7"],["ct-zone-22c71c2a-0e59-41cc-a2da-91d3c7276c11_dnat","9"],["ct-zone-22c71c2a-0e59-41cc-a2da-91d3c7276c11_snat","10"],["ct-zone-3228b120-4192-476b-ab67-51fb45e786d6_dnat","3"],["ct-zone-3228b120-4192-476b-ab67-51fb45e786d6_snat","4"],["ct-zone-3753ff1a-d0cf-48e4-b06a-640f0467d202_dnat","19"],["ct-zone-3753ff1a-d0cf-48e4-b06a-640f0467d202_snat","18"],["ct-zone-3c1c02f4-31c9-45d4-9c63-54ad2122bb15_dnat","10"],["ct-zone-3c1c02f4-31c9-45d4-9c63-54ad2122bb15_snat","16"],["ct-zone-423896cb-5573-4c54-b6e2-38f192eacae3_dnat","9"],["ct-zone-423896cb-55
73
-4c54-b6e2-38f192eacae3_snat","12"],["ct-zone-46b7b247-31a7-4fbb-88b9-0f3db042409c_dnat","10"],["ct-zone-46b7b247-31a7-4fbb-88b9-0f3db042409c_snat","11"],["ct-zone-51376927-fca0-49b3-b0ba-1aa22153b366_dnat","2"],["ct-zone-51376927-fca0-49b3-b0ba-1aa22153b366_snat","5"],["ct-zone-58033baa-916d-47d4-bcf0-d95f7fb1f861_dnat","18"],["ct-zone-58033baa-916d-47d4-bcf0-d95f7fb1f861_snat","3"],["ct-zone-5f92f974-f0dc-4820-bb43-a14cc16d851f_dnat","12"],["ct-zone-5f92f974-f0dc-4820-bb43-a14cc16d851f_snat","11"],["ct-zone-87055326-0535-4042-a0ff-bf0e9f494433_dnat","10"],["ct-zone-87055326-0535-4042-a0ff-bf0e9f494433_snat","12"],["ct-zone-8a840bfe-118f-4041-ac72-0637d6373ffc_dnat","1"],["ct-zone-8a840bfe-118f-4041-ac72-0637d6373ffc_snat","11"],["ct-zone-8fff9b0b-0fd6-42f9-ab77-e9f1475a5d82_dnat","2"],["ct-zone-8fff9b0b-0fd6-42f9-ab77-e9f1475a5d82_snat","13"],["ct-zone-913c36a1-f987-4084-9119-f279b317c72f_dnat","11"],["ct-zone-913c36a1-f987-4084-9119-f279b317c72f_snat","12"],["ct-zone-9498aca9
-762
3-4ce0-a0ff-d4d5c17d7223_dnat","19"],["ct-zone-9498aca9-7623-4ce0-a0ff-d4d5c17d7223_snat","15"],["ct-zone-9c373522-fd02-424f-a2b3-14dc359062d2_dnat","18"],["ct-zone-9c373522-fd02-424f-a2b3-14dc359062d2_snat","17"],["ct-zone-a28b45db-2dfb-4d38-905c-c5eb44da8c9c_dnat","13"],["ct-zone-a28b45db-2dfb-4d38-905c-c5eb44da8c9c_snat","10"],["ct-zone-b1e8636a-5cf8-48ba-9693-793a59e5430d_dnat","8"],["ct-zone-b1e8636a-5cf8-48ba-9693-793a59e5430d_snat","14"],["ct-zone-bbcc6e17-ee1e-4e82-b404-1dd0f1307002_dnat","12"],["ct-zone-bbcc6e17-ee1e-4e82-b404-1dd0f1307002_snat","11"],["ct-zone-bd3b86b7-2aba-4ff7-a5f7-975612692aca_dnat","13"],["ct-zone-bd3b86b7-2aba-4ff7-a5f7-975612692aca_snat","10"],["ct-zone-cb94affd-f2aa-4bdd-9407-1e16ac046596_dnat","9"],["ct-zone-cb94affd-f2aa-4bdd-9407-1e16ac046596_snat","1"],["ct-zone-ce71f6db-4dab-41ca-bd10-cd6204687b9d_dnat","16"],["ct-zone-ce71f6db-4dab-41ca-bd10-cd6204687b9d_snat","15"],["ct-zone-cfa46699-cc79-445e-a902-f1e37ff99806_dnat","5"],["ct-zone-cfa466
99-c
c79-445e-a902-f1e37ff99806_snat","2"],["ct-zone-cr-lr_vcn0747157-ls_vcn0747157_external_ugw","9"],["ct-zone-cr-lr_vcn1645571_igw-ls_vcn1645571_external_igw","21"],["ct-zone-cr-lr_vcn7319607-ls_vcn7319607_external_ugw","14"],["ct-zone-cr-lr_vcn7319607_igw-ls_vcn7319607_external_igw","21"],["ct-zone-cr-lr_vcn7395327_igw-ls_vcn7395327_external_igw","21"],["ct-zone-cr-lr_vcn9567153-ls_vcn9567153_external_ugw","1"],["ct-zone-d0232f68-8d26-454c-87bf-e79066a1ed62_dnat","9"],["ct-zone-d0232f68-8d26-454c-87bf-e79066a1ed62_snat","8"],["ct-zone-d161aaef-e73e-452c-9d77-f465718f1f67_dnat","3"],["ct-zone-d161aaef-e73e-452c-9d77-f465718f1f67_snat","6"],["ct-zone-e2f0a229-15b0-4255-b52d-71b078239ed2_dnat","12"],["ct-zone-e2f0a229-15b0-4255-b52d-71b078239ed2_snat","13"],["ct-zone-e6986bf4-e813-4df0-9bfe-1de95ceb2e30_dnat","15"],["ct-zone-e6986bf4-e813-4df0-9bfe-1de95ceb2e30_snat","14"],["ct-zone-e93b7a93-8507-4036-8281-f2be764a44da_dnat","16"],["ct-zone-e93b7a93-8507-4036-8281-f2be764a44da_snat"
,"17
"],["ct-zone-f3b9843a-d498-41dc-8244-0f87d9bc1384_dnat","6"],["ct-zone-f3b9843a-d498-41dc-8244-0f87d9bc1384_snat","7"],["ct-zone-f42fcb51-0af6-426f-974b-1478a169a70c_dnat","13"],["ct-zone-f42fcb51-0af6-426f-974b-1478a169a70c_snat","11"],["ct-zone-f708c12e-34b6-4657-b7d0-4b5ac5e0d6c7_dnat","20"],["ct-zone-f708c12e-34b6-4657-b7d0-4b5ac5e0d6c7_snat","19"],["ct-zone-ln-ls_vcn6603036_external_ugw","7"],["ct-zone-ln-ls_vcn7319607_external_igw","20"],["ct-zone-ln-ls_vcn7395327_external_ugw","7"],["ct-zone-ln-ls_vcn7836024_external_igw","20"],["ct-zone-ln-ls_vcn9567153_external_igw","21"],["ct-zone-ln-ls_vcn9567153_external_ugw","8"]]]}},"_comment":"ovn-controller:
modifying OVS tunnels 'pcacn001'"}
In which OVN version are you seeing this ?
ovs-vsctl -V
ovs-vsctl (Open vSwitch) 2.14.0_r0.0.0
DB Schema 8.2.0
# ovn-nbctl -V
ovn-nbctl 20.09.0_r1.0.0
Open vSwitch Library 2.14.0
DB Schema 5.27.0
I wonder if you're seeing this issue -
https://urldefense.com/v3/__https://github.com/ovn-org/ovn/commit/e7788554a7f5e824fc0d8afc6cbf20e94fe4245f__;!!ACWV5N9M2RV99hQ!bwIWH-KoNwkjzx2Sw8BLj6uGXg6zeGUoB-ZG4wtzO42NUmxA95Id3NxKLRgReUsdtEU$
Have to step out for a bit will look at this when I can
What I can say is that we are using ovsdbapp to configure central, and
I see /etc/openvswitch/conf.db
getting up to several Gb! so much so that systemd times out when you
try start the service using it.
I am also seeing ovs-vswitchd getting a SEGV on a regular basis which
I think is related.
I wondering if this patch might help
[External] : Re: [ovs-dev] [PATCH branch-2.14] python:
idl: Avoid sending transactions when the DB is not synced
up.
I'm not sure. /etc/openvswitch/conf.db is the local ovsdb-server database
and not the OVN database.
Numan
If you run a tail on /etc/openvswitch/conf.db, do you see the ct zone
ids toggling between 2 values constantly ?
Thanks
Numan
Thanks
Brendan
_______________________________________________
discuss mailing list
[email protected]
https://urldefense.com/v3/__https://mail.openvswitch.org/mailman/listinfo/ovs-discuss__;!!ACWV5N9M2RV99hQ!bwIWH-KoNwkjzx2Sw8BLj6uGXg6zeGUoB-ZG4wtzO42NUmxA95Id3NxKLRgR-G4xGfo$
_______________________________________________
discuss mailing list
[email protected]
https://urldefense.com/v3/__https://mail.openvswitch.org/mailman/listinfo/ovs-discuss__;!!ACWV5N9M2RV99hQ!cR934SfxrIJu507dsVUIyZ7JHH9WWkNjqT4uWiSsnnfk72lkytha0jMrSq39KbktpyU$
_______________________________________________
discuss mailing list
[email protected]
https://urldefense.com/v3/__https://mail.openvswitch.org/mailman/listinfo/ovs-discuss__;!!ACWV5N9M2RV99hQ!aXU0ishuScB8BUBe7ocXxXDlPWZCYdhri_dfVWZN8rSI68YA6J3XGRVlo1SQy9umVfs$
_______________________________________________
discuss mailing list
[email protected]
https://urldefense.com/v3/__https://mail.openvswitch.org/mailman/listinfo/ovs-discuss__;!!ACWV5N9M2RV99hQ!c1HxNgHI2KosY03K_FFa5GpfOez9mAgB_8fm8G8Z-hCxG9RpSlq-pE8OO1R0lILyU-k$
_______________________________________________
discuss mailing list
[email protected]
https://urldefense.com/v3/__https://mail.openvswitch.org/mailman/listinfo/ovs-discuss__;!!ACWV5N9M2RV99hQ!fD4xiCtsxdVfl4DnJx7GuPacUj3Tt3j19-f571D1i2v_sJfL7xvt0W_aJeZva9Y7nh8$
_______________________________________________
discuss mailing list
[email protected]
https://urldefense.com/v3/__https://mail.openvswitch.org/mailman/listinfo/ovs-discuss__;!!ACWV5N9M2RV99hQ!amdtq3tQhwFCtbvjxSuF5ItzNk_07I0bBJvt5mu3lbJc-NBU5rsCp9IIullXJ6POWWk$
_______________________________________________
discuss mailing list
[email protected]
https://urldefense.com/v3/__https://mail.openvswitch.org/mailman/listinfo/ovs-discuss__;!!ACWV5N9M2RV99hQ!YIe9Q_lQryTep-1Zlf--Es8-j252d9U6eMb-7csG5kA4tohxDhhcNzfzfrIC-Ur4X08$
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss