Hi Elias,

Thank you for the thorough analysis.
I think the best approach for now is the one you propose. Reserve as many slots 
as you have workers.
Potentially also increase the queue size > 64.

Damjan is looking at some further improvements in this space, but for now 
please go with what you propose.

Best regards,
Ole

> On 15 Apr 2020, at 14:26, Elias Rudberg <elias.rudb...@bahnhof.net> wrote:
> 
> Hello VPP experts,
> 
> We are using VPP for NAT44 and last week we encountered a problem where
> some VPP threads stopped forwarding traffic. We saw the problem on two
> separate VPP servers within a short time, apparently it was triggered
> by some specific kind of out2in traffic that arrived at that time.
> 
> As far as I can tell, this issue exists in both the current master
> branch and in the 1908 and 2001 branches.
> 
> After investigating and finally being able to reproduce the problem in
> a lab setting, we came to the following conclusion about what happened:
> 
> The scenario where this happens is that several threads (8 threads in
> our case) are used for NAT and the frame queues for handoff between
> threads are being congested for some of the threads. This can be
> triggered for example by "garbage" out2in traffic that comes in at some
> port, if much of the out2in traffic has the same destination port then
> much of the traffic will be handed off to the same thread, since the
> out2in handoff thread index is decided based on the dest port. It
> doesn't matter if the traffic belongs to any existing NAT sessions or
> not, since handoff must be done before checking that and the problem is
> related to the handoff.
> 
> When a frame queue is congested, that is supposed to be detected by the
> is_vlib_frame_queue_congested() call in
> vlib_buffer_enqueue_to_thread(). However, that check is not completely
> reliable since other threads may add things to the queue after the
> check. For example, it can happen that two threads call
> is_vlib_frame_queue_congested() simultaneously and both come to the
> conclusion that the queue is not congested when in fact it will be
> congested when one of them has added to the queue giving trouble for
> the other thread. This problem is to some extent mitigated by the fact
> that the check in is_vlib_frame_queue_congested() uses a
> "queue_hi_thresh" value that is set slightly lower than the number of
> elements in the queue, it is set like this:
> 
> fqm->queue_hi_thresh = frame_queue_nelts - 2;
> 
> The -2 there means that things are still OK if two threads call
> is_vlib_frame_queue_congested() simultaneously, but if three or four
> threads do it simultaneously we are anyway in trouble, and that seems
> to be what happened on our VPP servers last week. This leads to one or
> more threads being stuck in an infinite loop, in the loop that looks
> like this in vlib_get_frame_queue_elt():
> 
>  /* Wait until a ring slot is available */
>  while (new_tail >= fq->head_hint + fq->nelts)
>    vlib_worker_thread_barrier_check ();
> 
> The loop above is supposed to end when a different thread changes the
> value of the volatile variable fq->head_hint but that will not happen
> if the other thread is also stuck in this loop. We get a deadlock, A is
> waiting for B and B is waiting for A. In the context of NAT, thread A
> wants to handoff something to thread B at the same time as thread B
> wants to handoff something to thread A, while at the same time their
> frame queues are congested. This leads to those two threads being stuck
> in the loop forever, each of them waiting for the other one.
> 
> To me it looks like the subtraction by 2 when setting queue_hi_thresh
> is just an ad hoc choice, there is no reason why 2 would be enough. I
> think that to make it safe, we need to subtract the number of threads.
> Essentially, we need to ensure that there is room for each thread to
> reserve one extra element in the queue so that no thread can get stuck
> waiting in the loop above. I tested this by hard-coding -8 instead of
> -2 and then the problem cannot be reproduced anymore, so that fix seems
> to work. The frame_queue_nelts value is 64 so using -8 means that the
> queue is considered congested already 56 instead of 62 as it is now.
> 
> What do you think, is it a good solution to check the number of threads
> and use that to set "fqm->queue_hi_thresh = frame_queue_nelts -
> n_threads;"?
> 
> Best regards,
> Elias
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16088): https://lists.fd.io/g/vpp-dev/message/16088
Mute This Topic: https://lists.fd.io/mt/73030838/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to