Hi Elias, Thank you for the thorough analysis. I think the best approach for now is the one you propose. Reserve as many slots as you have workers. Potentially also increase the queue size > 64.
Damjan is looking at some further improvements in this space, but for now please go with what you propose. Best regards, Ole > On 15 Apr 2020, at 14:26, Elias Rudberg <elias.rudb...@bahnhof.net> wrote: > > Hello VPP experts, > > We are using VPP for NAT44 and last week we encountered a problem where > some VPP threads stopped forwarding traffic. We saw the problem on two > separate VPP servers within a short time, apparently it was triggered > by some specific kind of out2in traffic that arrived at that time. > > As far as I can tell, this issue exists in both the current master > branch and in the 1908 and 2001 branches. > > After investigating and finally being able to reproduce the problem in > a lab setting, we came to the following conclusion about what happened: > > The scenario where this happens is that several threads (8 threads in > our case) are used for NAT and the frame queues for handoff between > threads are being congested for some of the threads. This can be > triggered for example by "garbage" out2in traffic that comes in at some > port, if much of the out2in traffic has the same destination port then > much of the traffic will be handed off to the same thread, since the > out2in handoff thread index is decided based on the dest port. It > doesn't matter if the traffic belongs to any existing NAT sessions or > not, since handoff must be done before checking that and the problem is > related to the handoff. > > When a frame queue is congested, that is supposed to be detected by the > is_vlib_frame_queue_congested() call in > vlib_buffer_enqueue_to_thread(). However, that check is not completely > reliable since other threads may add things to the queue after the > check. For example, it can happen that two threads call > is_vlib_frame_queue_congested() simultaneously and both come to the > conclusion that the queue is not congested when in fact it will be > congested when one of them has added to the queue giving trouble for > the other thread. This problem is to some extent mitigated by the fact > that the check in is_vlib_frame_queue_congested() uses a > "queue_hi_thresh" value that is set slightly lower than the number of > elements in the queue, it is set like this: > > fqm->queue_hi_thresh = frame_queue_nelts - 2; > > The -2 there means that things are still OK if two threads call > is_vlib_frame_queue_congested() simultaneously, but if three or four > threads do it simultaneously we are anyway in trouble, and that seems > to be what happened on our VPP servers last week. This leads to one or > more threads being stuck in an infinite loop, in the loop that looks > like this in vlib_get_frame_queue_elt(): > > /* Wait until a ring slot is available */ > while (new_tail >= fq->head_hint + fq->nelts) > vlib_worker_thread_barrier_check (); > > The loop above is supposed to end when a different thread changes the > value of the volatile variable fq->head_hint but that will not happen > if the other thread is also stuck in this loop. We get a deadlock, A is > waiting for B and B is waiting for A. In the context of NAT, thread A > wants to handoff something to thread B at the same time as thread B > wants to handoff something to thread A, while at the same time their > frame queues are congested. This leads to those two threads being stuck > in the loop forever, each of them waiting for the other one. > > To me it looks like the subtraction by 2 when setting queue_hi_thresh > is just an ad hoc choice, there is no reason why 2 would be enough. I > think that to make it safe, we need to subtract the number of threads. > Essentially, we need to ensure that there is room for each thread to > reserve one extra element in the queue so that no thread can get stuck > waiting in the loop above. I tested this by hard-coding -8 instead of > -2 and then the problem cannot be reproduced anymore, so that fix seems > to work. The frame_queue_nelts value is 64 so using -8 means that the > queue is considered congested already 56 instead of 62 as it is now. > > What do you think, is it a good solution to check the number of threads > and use that to set "fqm->queue_hi_thresh = frame_queue_nelts - > n_threads;"? > > Best regards, > Elias >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#16088): https://lists.fd.io/g/vpp-dev/message/16088 Mute This Topic: https://lists.fd.io/mt/73030838/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-