Numan,

We enabled conditional monitoring on the affected machines and still
see the issue .

It seems crazy that such an important protocol such as DNS is blocked
on the main thread.  Can someone in the community help with a patch to
remove the mutex on the DNS ?  How safe will that ultimately be - I'm
sure it was there for a reason.

I'm not sure where to go from here to be honest.

Gav

On Wed, 8 May 2024 at 16:51, Numan Siddique <num...@ovn.org> wrote:
>
> On Wed, May 8, 2024 at 6:01 PM Gavin McKee via discuss
> <ovs-discuss@openvswitch.org> wrote:
> >
> > @Numan Siddique
> >
> > If we enable conditional monitoring here , will this help ?
> > How does transport zones help with something like this ?  Do they
> > limit the amount of processing .  We only have a single VM on this
> > node , so single LSP , Logical Switch etc that is actually needed or
> > used.
> > Would v24.03.0 
> > https://github.com/ovn-org/ovn/commit/1622526ff2102525e1bbf2ca262842c71d6b9b33
> > help here ?
> >
>
> I think conditional monitoring should work as ovn-controllers will not
> get updates for changes it's not interested.
> You can easily test this out.  I'm not sure about the transport zones.
> Even if you use transport zones,
> conditional monitoring should be enabled, otherwise ovn-controller
> will get updates for all SB DB changes.
>
> I don't think the 24.03 commit you linked to would help in your case
> as the issue is with ovn-controller main thread
> blocking the pinctrl thread from processing the DNS packet.
>
> Numan
>
> > Gav
> >
> > On Wed, 8 May 2024 at 14:43, Gavin McKee <gavmcke...@googlemail.com> wrote:
> > >
> > > Ok so
> > >
> > > 1. Customers depend on the internal DNS reccords, so this is needed
> > > for production operations
> > > 2. I can take a look at the updates - would using conditional
> > > monitoring work here?  We have ovn-monitor-all=true , would this help
> > > at all ?
> > > 3 & 4 . Is that something the community can help with?  Is that a
> > > viable long term fix we could maybe get a patch for ?
> > >
> > > Gav
> > >
> > > On Wed, 8 May 2024 at 14:30, Numan Siddique <num...@ovn.org> wrote:
> > > >
> > > > On Wed, May 8, 2024 at 3:20 PM Gavin McKee via discuss
> > > > <ovs-discuss@openvswitch.org> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Can someone help me understand why this issue occurs
> > > > >
> > > > >
> > > > > ovn-controller 23.09.1
> > > > > Open vSwitch Library 3.2.2
> > > > >
> > > > > We have an issue with some machines intermittently unable to resolve
> > > > > DNS for external domains (example dig +noall +answer
> > > > > harmonic-openai-canada.openai.azure.com)
> > > > >
> > > > > In the OVN controller log I see the following
> > > > >
> > > > > 2024-05-08T12:12:35.596Z|30138|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 7950ms
> > > > > 2024-05-08T14:50:29.747Z|30312|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 8634ms
> > > > > 2024-05-08T14:50:46.673Z|30329|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 8774ms
> > > > > 2024-05-08T14:54:40.781Z|30353|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 8535ms
> > > > > 2024-05-08T14:58:43.381Z|30433|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 8541ms
> > > > > 2024-05-08T14:58:56.802Z|30488|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 8820ms
> > > > > 2024-05-08T15:02:50.704Z|30512|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 8739ms
> > > > > 2024-05-08T15:03:05.206Z|30529|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 8686ms
> > > > > 2024-05-08T15:08:39.441Z|30569|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 9167ms
> > > > > 2024-05-08T15:09:09.152Z|30603|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 8985ms
> > > > > 2024-05-08T15:12:14.361Z|30632|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 8569ms
> > > > > 2024-05-08T15:13:52.535Z|30705|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 8764ms
> > > > > 2024-05-08T15:14:53.989Z|30732|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 8802ms
> > > > > 2024-05-08T15:16:30.911Z|30757|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 8776ms
> > > > > 2024-05-08T15:17:09.371Z|30784|inc_proc_eng|INFO|node:
> > > > > logical_flow_output, recompute (missing handler for input SB_dns) took
> > > > > 9062ms
> > > > >
> > > > > Why would this happen and is there something I can do about it ?  Are
> > > > > there more logs needed ?
> > > >
> > > > This indicates that your deployment is creating, updating or deleting a 
> > > > DNS row
> > > > in the Northbound database and in turn ovn-northd is updating the SB 
> > > > DNS rows.
> > > > When ovn-controller receives the Southbound DNS updates, it falls back
> > > > to a full recompute
> > > > because we are not handling these changes incrementally.   Since OVN
> > > > native DNS is configured
> > > > in your deployment, each DNS packet is sent to ovn-controller for 
> > > > lookup.
> > > > Even though a separate pinctrl thread handles packet-ins,  dns
> > > > handling is blocked until
> > > > the main ovn-controller thread releases a mutex [1].
> > > >
> > > > There are a few ways to resolve this
> > > >
> > > > 1.  Disable native OVN DNS if you're not using this feature.  To
> > > > disable, don't create any DNS records in the OVN Northbound db.
> > > > 2. Investigate why your deployment is updating the NB DBS table and
> > > > avoid it if its not required.
> > > > 3.  Implement a handler for SB DNS so that ovn-controller do not fall
> > > > back to a full recompute
> > > > 4.  Avoid locking on the mutex for DNS handling in pinctrl thread [1].
> > > >
> > > > (3) or (4) requires code changes.
> > > >
> > > > Thanks
> > > > Numan
> > > >
> > > > [1] - 
> > > > https://github.com/ovn-org/ovn/blob/main/controller/pinctrl.c#L3807
> > > >
> > > >
> > > > >
> > > > > Gav
> > > > > _______________________________________________
> > > > > discuss mailing list
> > > > > disc...@openvswitch.org
> > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > > > >
> > _______________________________________________
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to