Hello VPP experts,

We are using VPP for NAT44 and last week we encountered a problem where
some VPP threads stopped forwarding traffic. We saw the problem on two
separate VPP servers within a short time, apparently it was triggered
by some specific kind of out2in traffic that arrived at that time.

As far as I can tell, this issue exists in both the current master
branch and in the 1908 and 2001 branches.

After investigating and finally being able to reproduce the problem in
a lab setting, we came to the following conclusion about what happened:

The scenario where this happens is that several threads (8 threads in
our case) are used for NAT and the frame queues for handoff between
threads are being congested for some of the threads. This can be
triggered for example by "garbage" out2in traffic that comes in at some
port, if much of the out2in traffic has the same destination port then
much of the traffic will be handed off to the same thread, since the
out2in handoff thread index is decided based on the dest port. It
doesn't matter if the traffic belongs to any existing NAT sessions or
not, since handoff must be done before checking that and the problem is
related to the handoff.

When a frame queue is congested, that is supposed to be detected by the
is_vlib_frame_queue_congested() call in
vlib_buffer_enqueue_to_thread(). However, that check is not completely
reliable since other threads may add things to the queue after the
check. For example, it can happen that two threads call
is_vlib_frame_queue_congested() simultaneously and both come to the
conclusion that the queue is not congested when in fact it will be
congested when one of them has added to the queue giving trouble for
the other thread. This problem is to some extent mitigated by the fact
that the check in is_vlib_frame_queue_congested() uses a
"queue_hi_thresh" value that is set slightly lower than the number of
elements in the queue, it is set like this:

fqm->queue_hi_thresh = frame_queue_nelts - 2;

The -2 there means that things are still OK if two threads call
is_vlib_frame_queue_congested() simultaneously, but if three or four
threads do it simultaneously we are anyway in trouble, and that seems
to be what happened on our VPP servers last week. This leads to one or
more threads being stuck in an infinite loop, in the loop that looks
like this in vlib_get_frame_queue_elt():

  /* Wait until a ring slot is available */
  while (new_tail >= fq->head_hint + fq->nelts)
    vlib_worker_thread_barrier_check ();

The loop above is supposed to end when a different thread changes the
value of the volatile variable fq->head_hint but that will not happen
if the other thread is also stuck in this loop. We get a deadlock, A is
waiting for B and B is waiting for A. In the context of NAT, thread A
wants to handoff something to thread B at the same time as thread B
wants to handoff something to thread A, while at the same time their
frame queues are congested. This leads to those two threads being stuck
in the loop forever, each of them waiting for the other one.

To me it looks like the subtraction by 2 when setting queue_hi_thresh
is just an ad hoc choice, there is no reason why 2 would be enough. I
think that to make it safe, we need to subtract the number of threads.
Essentially, we need to ensure that there is room for each thread to
reserve one extra element in the queue so that no thread can get stuck
waiting in the loop above. I tested this by hard-coding -8 instead of
-2 and then the problem cannot be reproduced anymore, so that fix seems
to work. The frame_queue_nelts value is 64 so using -8 means that the
queue is considered congested already 56 instead of 62 as it is now.

What do you think, is it a good solution to check the number of threads
and use that to set "fqm->queue_hi_thresh = frame_queue_nelts -
n_threads;"?

Best regards,
Elias
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16083): https://lists.fd.io/g/vpp-dev/message/16083
Mute This Topic: https://lists.fd.io/mt/73030838/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to