Hello VPP experts, We are using VPP for NAT44 and last week we encountered a problem where some VPP threads stopped forwarding traffic. We saw the problem on two separate VPP servers within a short time, apparently it was triggered by some specific kind of out2in traffic that arrived at that time.
As far as I can tell, this issue exists in both the current master branch and in the 1908 and 2001 branches. After investigating and finally being able to reproduce the problem in a lab setting, we came to the following conclusion about what happened: The scenario where this happens is that several threads (8 threads in our case) are used for NAT and the frame queues for handoff between threads are being congested for some of the threads. This can be triggered for example by "garbage" out2in traffic that comes in at some port, if much of the out2in traffic has the same destination port then much of the traffic will be handed off to the same thread, since the out2in handoff thread index is decided based on the dest port. It doesn't matter if the traffic belongs to any existing NAT sessions or not, since handoff must be done before checking that and the problem is related to the handoff. When a frame queue is congested, that is supposed to be detected by the is_vlib_frame_queue_congested() call in vlib_buffer_enqueue_to_thread(). However, that check is not completely reliable since other threads may add things to the queue after the check. For example, it can happen that two threads call is_vlib_frame_queue_congested() simultaneously and both come to the conclusion that the queue is not congested when in fact it will be congested when one of them has added to the queue giving trouble for the other thread. This problem is to some extent mitigated by the fact that the check in is_vlib_frame_queue_congested() uses a "queue_hi_thresh" value that is set slightly lower than the number of elements in the queue, it is set like this: fqm->queue_hi_thresh = frame_queue_nelts - 2; The -2 there means that things are still OK if two threads call is_vlib_frame_queue_congested() simultaneously, but if three or four threads do it simultaneously we are anyway in trouble, and that seems to be what happened on our VPP servers last week. This leads to one or more threads being stuck in an infinite loop, in the loop that looks like this in vlib_get_frame_queue_elt(): /* Wait until a ring slot is available */ while (new_tail >= fq->head_hint + fq->nelts) vlib_worker_thread_barrier_check (); The loop above is supposed to end when a different thread changes the value of the volatile variable fq->head_hint but that will not happen if the other thread is also stuck in this loop. We get a deadlock, A is waiting for B and B is waiting for A. In the context of NAT, thread A wants to handoff something to thread B at the same time as thread B wants to handoff something to thread A, while at the same time their frame queues are congested. This leads to those two threads being stuck in the loop forever, each of them waiting for the other one. To me it looks like the subtraction by 2 when setting queue_hi_thresh is just an ad hoc choice, there is no reason why 2 would be enough. I think that to make it safe, we need to subtract the number of threads. Essentially, we need to ensure that there is room for each thread to reserve one extra element in the queue so that no thread can get stuck waiting in the loop above. I tested this by hard-coding -8 instead of -2 and then the problem cannot be reproduced anymore, so that fix seems to work. The frame_queue_nelts value is 64 so using -8 means that the queue is considered congested already 56 instead of 62 as it is now. What do you think, is it a good solution to check the number of threads and use that to set "fqm->queue_hi_thresh = frame_queue_nelts - n_threads;"? Best regards, Elias
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#16083): https://lists.fd.io/g/vpp-dev/message/16083 Mute This Topic: https://lists.fd.io/mt/73030838/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-