Guys, I am not sure , my input is helpful or not but the same issue was triggered to me and we concluded it in a different way. In your packet trace , it seems pkt is triggering to vpp node ip4-sv-reassembly-feature. I suggest first try to enable reassembly on the interface,
*set interface reassembly <interface-name> on* Maybe some experts can say by default reassembly is enabled on the interface but it's not. try it once, let us know , still u r facing problems. ?? *Regards*, Mrityunjay Kumar. Mobile: +91 - 9731528504 On Wed, Dec 2, 2020 at 9:10 PM Elias Rudberg <elias.rudb...@bahnhof.net> wrote: > Hello VPP experts, > > For our NAT44 usage of VPP we have encountered a problem with VPP > running out of memory, which now, after much headache and many out-of- > memory crashes over the past several months, has turned out to be > caused by an infinite loop where VPP gets stuck repeating the three > nodes ip4-lookup, ip4-local and nat44-hairpinning. A single packet gets > passed around and around between those three nodes, eating more and > more memory which causes that worker thread to get stuck and VPP to run > out of memory after a few seconds. (Earlier we speculated that it was > due to a memory leak but now it seems it was not.) > > This concerns the current master branch as well as the stable/2009 > branches and earlier VPP versions as well. > > One scenario when this happens is when a UDP (or TCP) packet is sent > from a client on the inside with a destination IP address that matches > an existing static NAT mapping that maps that IP address on the inside > to the same IP address on the outside. > > Then, the problem can be triggered for example by doing this from a > client on the inside, where DESTINATION_IP is the IP address of such a > static mapping: > > echo hello > /dev/udp/$DESTINATION_IP/33333 > > Here is the packet trace for the thread that receives the packet at > rdma-input: > > ---------- > > Packet 42 > > 00:03:07:636840: rdma-input > rdma: Interface179 (4) next-node bond-input l2-ok l3-ok l4-ok ip4 udp > 00:03:07:636841: bond-input > src d4:6a:35:52:30:db, dst 02:fe:8d:23:60:a7, Interface179 -> > BondEthernet0 > 00:03:07:636843: ethernet-input > IP4: d4:6a:35:52:30:db -> 02:fe:8d:23:60:a7 802.1q vlan 1013 > 00:03:07:636844: ip4-input > UDP: SOURCE_IP_INSIDE -> DESTINATION_IP > tos 0x00, ttl 63, length 34, checksum 0xe7e3 dscp CS0 ecn NON_ECN > fragment id 0x50fe, flags DONT_FRAGMENT > UDP: 48824 -> 33333 > length 14, checksum 0x781e > 00:03:07:636846: ip4-sv-reassembly-feature > [not-fragmented] > 00:03:07:636847: nat44-in2out-worker-handoff > NAT44_IN2OUT_WORKER_HANDOFF : next-worker 8 trace index 41 > > ---------- > > So it is doing handoff to thread 8 with trace index 41. Nothing wrong > so far, I think. > > Here is the beginning of the corresponding packet trace for the > receiving thread: > > ---------- > > Packet 57 > > 00:03:07:636850: handoff_trace > HANDED-OFF: from thread 7 trace index 41 > 00:03:07:636850: nat44-in2out > NAT44_IN2OUT_FAST_PATH: sw_if_index 6, next index 3, session -1 > 00:03:07:636855: nat44-in2out-slowpath > NAT44_IN2OUT_SLOW_PATH: sw_if_index 6, next index 0, session 11 > 00:03:07:636927: ip4-lookup > fib 0 dpo-idx 577 flow hash: 0x00000000 > UDP: SOURCE_IP_OUTSIDE -> DESTINATION_IP > tos 0x00, ttl 63, length 34, checksum 0x5eee dscp CS0 ecn NON_ECN > fragment id 0x50fe, flags DONT_FRAGMENT > UDP: 63957 -> 33333 > length 14, checksum 0xb40b > 00:03:07:636930: ip4-local > UDP: SOURCE_IP_OUTSIDE -> DESTINATION_IP > tos 0x00, ttl 63, length 34, checksum 0x5eee dscp CS0 ecn NON_ECN > fragment id 0x50fe, flags DONT_FRAGMENT > UDP: 63957 -> 33333 > length 14, checksum 0xb40b > 00:03:07:636932: nat44-hairpinning > new dst addr DESTINATION_IP port 33333 fib-index 0 is-static-mapping > 00:03:07:636934: ip4-lookup > fib 0 dpo-idx 577 flow hash: 0x00000000 > UDP: SOURCE_IP_OUTSIDE -> DESTINATION_IP > tos 0x00, ttl 63, length 34, checksum 0x5eee dscp CS0 ecn NON_ECN > fragment id 0x50fe, flags DONT_FRAGMENT > UDP: 63957 -> 33333 > length 14, checksum 0xb40b > 00:03:07:636936: ip4-local > UDP: SOURCE_IP_OUTSIDE -> DESTINATION_IP > tos 0x00, ttl 63, length 34, checksum 0x5eee dscp CS0 ecn NON_ECN > fragment id 0x50fe, flags DONT_FRAGMENT > UDP: 63957 -> 33333 > length 14, checksum 0xb40b > 00:03:07:636937: nat44-hairpinning > new dst addr DESTINATION_IP port 33333 fib-index 0 is-static-mapping > 00:03:07:636937: ip4-lookup > fib 0 dpo-idx 577 flow hash: 0x00000000 > UDP: SOURCE_IP_OUTSIDE -> DESTINATION_IP > tos 0x00, ttl 63, length 34, checksum 0x5eee dscp CS0 ecn NON_ECN > fragment id 0x50fe, flags DONT_FRAGMENT > UDP: 63957 -> 33333 > length 14, checksum 0xb40b > 00:03:07:636939: ip4-local > UDP: SOURCE_IP_OUTSIDE -> DESTINATION_IP > tos 0x00, ttl 63, length 34, checksum 0x5eee dscp CS0 ecn NON_ECN > fragment id 0x50fe, flags DONT_FRAGMENT > UDP: 63957 -> 33333 > length 14, checksum 0xb40b > 00:03:07:636940: nat44-hairpinning > new dst addr DESTINATION_IP port 33333 fib-index 0 is-static-mapping > > ... > > ... and so on. In principle it never ends. To get this trace I had > added a hack in nat44-hairpinning to stop when my added debug counter > exceeded a few thousand. Without that, it seems to loop forever, that > worker thread gets stuck. > > What happens seems to be that the nat44-hairpinning node determines > that there is an existing session and then decides the packet should go > to the ip4-lookup node, followed by the ip4-local, followed by the > nat44-hairpinning node which makes the same decision again, so it just > goes round and round like that. Inside the snat_hairpinning() function > it always comes to the "Destination is behind the same NAT, use > internal address and port" part and returns 1 there which causes it to > choose the ip4-lookup node as next node, even if nothing actually > changed. > > I have a fix that involves changing the snat_hairpinning() function so > that it checks if nothing has changed and in that case returns 0, > effectively breaking the otherwise infinite loop, but I am not > convinced that this is a good solution, feels a bit like an ugly fix > even though it seems to solve the problem in practice. > > Two different questions related to this: > > (1) The specific NAT hairpinning issue, what should be done to handle > it properly? > > (2) A more general question about detecting if VPP gets into an > infinite loop where the same packet gets passed around a ridiculously > large number of times among different nodes, maybe it would be a good > idea to try to detect when that happens and give an error message about > it instead of hanging or crashing when it happens? > > Best regards, > Elias > > > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#18239): https://lists.fd.io/g/vpp-dev/message/18239 Mute This Topic: https://lists.fd.io/mt/78662322/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-