We are using VPP 19.08 for NAT (nat44) and are struggling with the
following problem: it first works seemingly fine for a while, like
several days or weeks, but then suddenly VPP stops forwarding traffic.
Even ping to the "outside" IP address fails.

The VPP process is still running so we try to investigate further using
vppctl, enabling packet trace as follows:

clear trace
trace add rdma-input 5

then doing ping to "outside" and then "show trace".

To see the normal behavior we have compared to another server running
VPP without the strange problem happening; there we can see that the
normal behavior is that one worker starts processing the packet and
then does NAT44_OUT2IN_WORKER_HANDOFF after which another worker takes
over: "handoff_trace" and then "HANDED-OFF: from thread..." and then
that worker continues processing the packet.
So the relevant parts of the trace look like this (abbreviated to show
only node names and handoff info) for a case when thread 8 hands off
work to thread 3:

------------------- Start of thread 3 vpp_wk_2 -------------------
Packet 1

08:15:10:781992: handoff_trace
  HANDED-OFF: from thread 8 trace index 0
08:15:10:781992: nat44-out2in
08:15:10:782008: ip4-lookup
08:15:10:782009: ip4-local
08:15:10:782010: ip4-icmp-input
08:15:10:782011: ip4-icmp-echo-request
08:15:10:782011: ip4-load-balance
08:15:10:782013: ip4-rewrite
08:15:10:782014: BondEthernet0-output

------------------- Start of thread 8 vpp_wk_7 -------------------
Packet 1

08:15:10:781986: rdma-input
08:15:10:781988: bond-input
08:15:10:781989: ethernet-input
08:15:10:781989: ip4-input
08:15:10:781990: nat44-out2in-worker-handoff
  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0

The above is what it looks like normally. The problem is that
sometimes, for some reason, the handoff stops working so that we only
get the initial processing by a worker and that working saying
NAT44_OUT2IN_WORKER_HANDOFF but the other worker does not pick up the
work, it is seemingly ignored.

Here is what it looks like then, when the problem has happened, thread
7 trying to handoff to thread 3:

------------------- Start of thread 3 vpp_wk_2 -------------------
No packets in trace buffer

------------------- Start of thread 7 vpp_wk_6 -------------------
Packet 1

08:38:41:904654: rdma-input
08:38:41:904656: bond-input
08:38:41:904658: ethernet-input
08:38:41:904660: ip4-input
08:38:41:904663: nat44-out2in-worker-handoff
  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0

So, work is also in this case handed off to thread 3 but thread 3 does
not pick it up. There is no "HANDED-OFF" message in the trace at all,
not for any worker. It seems like the handed-off work was ignored. Then
of course it is understandable that the ping does not work and packet
forwarding does not work, the question is: why does that hand-off
procedure fail?

Are there some known reasons that can cause this behavior?

When there is a NAT44_OUT2IN_WORKER_HANDOFF message in the packet
trace, should there always be a corresponding "HANDED-OFF" message for
another thread picking it up?

One more question related to the above: sometimes when looking at trace
for ICMP packets to investigate this problem we have seen a worker
apparently handing off work to itself, which seems strange. Example:

------------------- Start of thread 3 vpp_wk_2 -------------------
Packet 1

08:31:23:871274: rdma-input
08:31:23:871279: bond-input
08:31:23:871282: ethernet-input
08:31:23:871285: ip4-input
08:31:23:871289: nat44-out2in-worker-handoff
  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0

If the purpose of "handoff" is to let another thread take over, then
this seems strange by itself (even without considering that there is no
"HANDED-OFF" for any thread): why is thread 3 trying to handoff work to
itself? Does that indicate something wrong or are there legitimate
cases where a thread "hands off" something to itself?

We have encountered this problem several times but unfortunately we
have not yet found a way to reproduce it in a lab environment, we do
not know exactly what triggers the problem. Previous times, when we
have restarted vpp it starts working normally again.

Any input on this or ideas for how to troubleshoot further would be
much appreciated.

Best regards,
Elias
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14602): https://lists.fd.io/g/vpp-dev/message/14602
Mute This Topic: https://lists.fd.io/mt/59112885/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to