We are using VPP 19.08 for NAT (nat44) and are struggling with the following problem: it first works seemingly fine for a while, like several days or weeks, but then suddenly VPP stops forwarding traffic. Even ping to the "outside" IP address fails.
The VPP process is still running so we try to investigate further using vppctl, enabling packet trace as follows: clear trace trace add rdma-input 5 then doing ping to "outside" and then "show trace". To see the normal behavior we have compared to another server running VPP without the strange problem happening; there we can see that the normal behavior is that one worker starts processing the packet and then does NAT44_OUT2IN_WORKER_HANDOFF after which another worker takes over: "handoff_trace" and then "HANDED-OFF: from thread..." and then that worker continues processing the packet. So the relevant parts of the trace look like this (abbreviated to show only node names and handoff info) for a case when thread 8 hands off work to thread 3: ------------------- Start of thread 3 vpp_wk_2 ------------------- Packet 1 08:15:10:781992: handoff_trace HANDED-OFF: from thread 8 trace index 0 08:15:10:781992: nat44-out2in 08:15:10:782008: ip4-lookup 08:15:10:782009: ip4-local 08:15:10:782010: ip4-icmp-input 08:15:10:782011: ip4-icmp-echo-request 08:15:10:782011: ip4-load-balance 08:15:10:782013: ip4-rewrite 08:15:10:782014: BondEthernet0-output ------------------- Start of thread 8 vpp_wk_7 ------------------- Packet 1 08:15:10:781986: rdma-input 08:15:10:781988: bond-input 08:15:10:781989: ethernet-input 08:15:10:781989: ip4-input 08:15:10:781990: nat44-out2in-worker-handoff NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0 The above is what it looks like normally. The problem is that sometimes, for some reason, the handoff stops working so that we only get the initial processing by a worker and that working saying NAT44_OUT2IN_WORKER_HANDOFF but the other worker does not pick up the work, it is seemingly ignored. Here is what it looks like then, when the problem has happened, thread 7 trying to handoff to thread 3: ------------------- Start of thread 3 vpp_wk_2 ------------------- No packets in trace buffer ------------------- Start of thread 7 vpp_wk_6 ------------------- Packet 1 08:38:41:904654: rdma-input 08:38:41:904656: bond-input 08:38:41:904658: ethernet-input 08:38:41:904660: ip4-input 08:38:41:904663: nat44-out2in-worker-handoff NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0 So, work is also in this case handed off to thread 3 but thread 3 does not pick it up. There is no "HANDED-OFF" message in the trace at all, not for any worker. It seems like the handed-off work was ignored. Then of course it is understandable that the ping does not work and packet forwarding does not work, the question is: why does that hand-off procedure fail? Are there some known reasons that can cause this behavior? When there is a NAT44_OUT2IN_WORKER_HANDOFF message in the packet trace, should there always be a corresponding "HANDED-OFF" message for another thread picking it up? One more question related to the above: sometimes when looking at trace for ICMP packets to investigate this problem we have seen a worker apparently handing off work to itself, which seems strange. Example: ------------------- Start of thread 3 vpp_wk_2 ------------------- Packet 1 08:31:23:871274: rdma-input 08:31:23:871279: bond-input 08:31:23:871282: ethernet-input 08:31:23:871285: ip4-input 08:31:23:871289: nat44-out2in-worker-handoff NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0 If the purpose of "handoff" is to let another thread take over, then this seems strange by itself (even without considering that there is no "HANDED-OFF" for any thread): why is thread 3 trying to handoff work to itself? Does that indicate something wrong or are there legitimate cases where a thread "hands off" something to itself? We have encountered this problem several times but unfortunately we have not yet found a way to reproduce it in a lab environment, we do not know exactly what triggers the problem. Previous times, when we have restarted vpp it starts working normally again. Any input on this or ideas for how to troubleshoot further would be much appreciated. Best regards, Elias
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#14602): https://lists.fd.io/g/vpp-dev/message/14602 Mute This Topic: https://lists.fd.io/mt/59112885/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-