Hi Florin, while trying to fix the proxy cleanup issue, I've spotted another problem in the TCP stack, namely RSTs being ignored in SYN_SENT (half-open) connection state: https://gerrit.fd.io/r/c/vpp/+/28103
The following fix for handling failed active connections in the proxy has worked for me, but please confirm that it's correct: https://gerrit.fd.io/r/c/vpp/+/28104 After https://gerrit.fd.io/r/c/vpp/+/28096 , I no longer see any "disp error" diagnostics for CLOSE_WAIT but I still do see "disp errors" for SYNs in TIME_WAIT and FIN_WAIT_2 states. I didn't enable reorder in netem, the config for it that I now use is this: ip netns exec client tc qdisc add dev client root netem \ loss 15% delay 100ms 300ms duplicate 10% I tried to abandon my older proxy patch that just removes ASSERT (0), but somehow I've lost the ability to log into fd.io Gerrit with my Linux Foundation login/password, although I can still submit patches and also log into identity.linuxfoundation.org just fine. Cc'ing Vanessa again here, could you please help with this? On Tue, Jul 28, 2020 at 9:27 PM Florin Coras <fcoras.li...@gmail.com> wrote: > Hi Ivan, > > Inline. > > On Jul 28, 2020, at 8:45 AM, Ivan Shvedunov <ivan...@gmail.com> wrote: > > Hi Florin, > thanks, the fix has worked and http_static no longer crashes. > > > Perfect, thanks for confirming! > > > I still get a number of messages like this when using release build: > /usr/bin/vpp[39]: state_sent_ok:954: BUG: couldn't send response header! > Not sure if it's actually a bug or the queue being actually full because > of the packet loss being caused by netem. > > > Not sure. Either that or there’s some issue with the send logic in some > corner case. If you have hsm->debug_level > 0 you should get a warning when > the enqueue fails because there’s no space left for sending. Note that we > probably need some logic to retry sending if enqueue fails (fill in usual > disclaimer that these are example test apps, unfortunately). I can try to > provide some guidance if anybody feels like tackling that specific problem. > > > There are also numerous messages like > /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2817: tcp conn 13836 disp > error state CLOSE_WAIT flags 0x02 SYN > but I guess that's caused by a large number of connections from the same > client => port reuse and, again, the packet loss. > > > Are you also inducing reorder? > > It seems the client closes the connection cleanly (fin) and before that is > completed, which as a first step would require a close from the app and a > move to last-ack, it sends a syn. Alternatively, reordering makes it look > that way. Those warnings are not harmful per se, because tcp will just drop > the packets, but we could try processing the syns to maybe force resets. If > you have the time to do it, try this [1] > > > I've spotted another crash in the proxy example for which I've submitted a > patch: > https://gerrit.fd.io/r/c/vpp/+/28085 > > One problem with this fix is that the proxy example is actually > incomplete, as the client connection should be dropped in case if the proxy > can't establish the corresponding outgoing connection (e.g. timeout or RST > b/c of "connection refused"), whereas currently the client-side connection > just hangs instead. I'll take a stab at fixing that as our proxy in UPF > also has problems with handling similar situations and it would be easier > for me to fix it in this simpler example first. But, in any case, crashing > VPP is probably much worse than just letting the connection hang, so maybe > we should remove that ASSERT after all. > > > Exactly. The assert there is just a reminder to eventually do the proper > cleanup. Thanks for looking into it! > > Regards, > Florin > > [1] https://gerrit.fd.io/r/c/vpp/+/28096 > > > > On Mon, Jul 27, 2020 at 10:59 PM Florin Coras <fcoras.li...@gmail.com> > wrote: > >> Hi Ivan, >> >> Took a look at the static http server and, as far as I can tell, it has >> the same type of issue the proxy had, i.e., premature session >> cleanup/reuse. Does this solve the problem for you [1]? >> >> Also, merged your elog fix patch. Thanks! >> >> Regards, >> Florin >> >> [1] https://gerrit.fd.io/r/c/vpp/+/28076 >> >> On Jul 27, 2020, at 10:22 AM, Ivan Shvedunov <ivan...@gmail.com> wrote: >> >> Hi. >> I've debugged http server issue a bit more and here are my observations: >> if I add an ASSERT(0) in the place of "No http session for thread 0 >> session_index 54", >> I get stack trace along the lines of >> Program received signal SIGABRT, Aborted. >> 0x00007ffff470bf47 in raise () from /lib/x86_64-linux-gnu/libc.so.6 >> #0 0x00007ffff470bf47 in raise () from /lib/x86_64-linux-gnu/libc.so.6 >> #1 0x00007ffff470d8b1 in abort () from /lib/x86_64-linux-gnu/libc.so.6 >> #2 0x0000000000407193 in os_panic () at /src/vpp/src/vpp/vnet/main.c:371 >> #3 0x00007ffff55ec619 in debugger () at /src/vpp/src/vppinfra/error.c:84 >> #4 0x00007ffff55ec397 in _clib_error (how_to_die=2, function_name=0x0, >> line_number=0, >> fmt=0x7fffb0e1324b "%s:%d (%s) assertion `%s' fails") >> at /src/vpp/src/vppinfra/error.c:143 >> #5 0x00007fffb0e0cc3d in http_static_server_rx_tx_callback >> (s=0x7fffb66447c0, >> cf=CALLED_FROM_RX) at >> /src/vpp/src/plugins/http_static/static_server.c:1030 >> #6 0x00007fffb0e0b437 in http_static_server_rx_callback >> (s=0x7fffb66447c0) >> at /src/vpp/src/plugins/http_static/static_server.c:1056 >> #7 0x00007ffff745a94e in app_worker_builtin_rx (app_wrk=0x7fffb59c7e00, >> s=0x7fffb66447c0) at >> /src/vpp/src/vnet/session/application_worker.c:485 >> #8 0x00007ffff745c00e in app_send_io_evt_rx (app_wrk=0x7fffb59c7e00, >> s=0x7fffb66447c0) >> at /src/vpp/src/vnet/session/application_worker.c:691 >> #9 0x00007ffff745b34e in app_worker_lock_and_send_event >> (app=0x7fffb59c7e00, >> s=0x7fffb66447c0, evt_type=0 '\000') >> at /src/vpp/src/vnet/session/application_worker.c:779 >> #10 0x00007ffff73eda5e in session_enqueue_notify_inline (s=0x7fffb66447c0) >> at /src/vpp/src/vnet/session/session.c:632 >> #11 0x00007ffff73eddbf in session_main_flush_enqueue_events >> (transport_proto=0 '\000', >> thread_index=0) at /src/vpp/src/vnet/session/session.c:736 >> #12 0x00007ffff680c2d4 in tcp46_rcv_process_inline ( >> vm=0x7ffff6477b80 <vlib_global_main>, node=0x7fffb522a680, >> from_frame=0x7fffb5a05f80, is_ip4=1) at >> /src/vpp/src/vnet/tcp/tcp_input.c:2478 >> #13 0x00007ffff680b5ea in tcp4_rcv_process_node_fn_hsw >> >> (line numbers maybe slightly off due to debug print). >> If I further dump the connection elog for such sessions I get >> /usr/bin/vpp[39]: http_static_server_rx_tx_callback:1028: ZZZZZ: >> connection elog: >> 92.494240039: close: cidx 52 >> 92.494241379: fin-tx: snd_nxt 1573 rcv_nxt 235 >> 92.494241431: state: fin-wait-1 >> 92.495088424: out: flags 11, bytes 32 >> 92.922939929: timer-pop: retransmit cidx 52 sidx 54 >> 92.922978305: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961 >> snd_wnd 64128 >> 92.922978440: rcv_stat: rto 344 srtt 80 mrtt-us 39371 rttvar 66 >> 92.922978491: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt 262 >> 92.922978711: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0 >> flight 1 >> 92.922980365: fin-tx: snd_nxt 1573 rcv_nxt 235 >> 92.923492112: out: flags 11, bytes 32 >> 93.711846499: timer-pop: retransmit cidx 52 sidx 54 >> 93.735094287: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961 >> snd_wnd 64128 >> 93.735094392: rcv_stat: rto 688 srtt 80 mrtt-us 39371 rttvar 66 >> 93.735094448: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt 1 >> 93.735094532: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0 >> flight 1 >> 93.735095835: fin-tx: snd_nxt 1573 rcv_nxt 235 >> 93.735582660: out: flags 11, bytes 32 >> 94.568121707: timer-pop: retransmit cidx 52 sidx 54 >> 94.584413829: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961 >> snd_wnd 64128 >> 94.584414153: rcv_stat: rto 1376 srtt 80 mrtt-us 39371 rttvar 66 >> 94.584414259: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt >> 8192 >> 94.584414333: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0 >> flight 1 >> 94.584417424: fin-tx: snd_nxt 1573 rcv_nxt 235 >> 94.585024738: out: flags 11, bytes 32 >> 95.999473091: timer-pop: retransmit cidx 52 sidx 54 >> 96.088472363: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961 >> snd_wnd 64128 >> 96.088472647: rcv_stat: rto 2752 srtt 80 mrtt-us 39371 rttvar 66 >> 96.088472758: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt >> 8192 >> 96.088472842: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0 >> flight 1 >> 96.088475171: fin-tx: snd_nxt 1573 rcv_nxt 235 >> 96.126505063: out: flags 11, bytes 32 >> 98.861064635: timer-pop: retransmit cidx 52 sidx 54 >> 98.898809592: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961 >> snd_wnd 64128 >> 98.898809726: rcv_stat: rto 5504 srtt 80 mrtt-us 39371 rttvar 66 >> 98.898809768: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt >> 8192 >> 98.898809828: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0 >> flight 1 >> 98.898811259: fin-tx: snd_nxt 1573 rcv_nxt 235 >> 98.899350729: out: flags 11, bytes 32 >> 104.181435961: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961 >> snd_wnd 64128 >> 104.181436051: rcv_stat: rto 11008 srtt 80 mrtt-us 39371 rttvar 66 >> 104.181436102: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt >> 55986 >> 104.181436602: ack-rx: 1 snd_una 1574 snd_wnd 64128 cwnd 2234 >> inflight 0 >> 104.181437352: state: fin-wait-2 >> 104.181438642: in: order len 39 written 39 rcv_nxt 274 rcv_wnd(o) >> 8192 >> 104.181439438: fin-rx: snd_nxt 1574 rcv_nxt 274 >> 104.181439499: state: time-wait >> >> which, along with some added debug print, indicates >> that http_static_server_rx_callback() on a connection that was already >> closed and for which http_static_server_session_disconnect_callback() was >> already called. >> Looks like some of the enqueued data is pushed >> via session_main_flush_enqueue_events() after the disconnect callback. At >> this point, I'm not sure how that should be fixed, but probably it should >> be as invoking rx callback after disconnect one doesn't seem to be the >> expected behaviour. >> >> I also noticed another problem with uninitialized elog track with TCP >> debugging enabled for which I submitted a fix (Gerrit worked for me this >> time), basically that TCP_EVT (TCP_EVT_SYN_RCVD, child, 1); that was moved >> in one of previous fixes must be moved even further up, as in case of >> failed session_stream_accept() another elog crash will happen >> in tcp_connection_cleanup() if the elog track is not initialized. >> The patch is here: https://gerrit.fd.io/r/c/vpp/+/28074 >> >> As of the timer crash in http_static in case of the release build, I'm a >> bit unsure but it looks like http_static_server_session_timer_stop() is not >> idempotent, as it doesn't reset the timer handle. If you think this is a >> correct guess, I can re-verify this patch and submit it. >> >> diff --git a/src/plugins/http_static/static_server.c >> b/src/plugins/http_static/static_server.c >> index 45d8731af..7772a9ec0 100644 >> --- a/src/plugins/http_static/static_server.c >> +++ b/src/plugins/http_static/static_server.c >> @@ -124,6 +124,7 @@ http_static_server_session_timer_stop (http_session_t >> * hs) >> return; >> clib_spinlock_lock (&http_static_server_main.tw_lock); >> tw_timer_stop_2t_1w_2048sl (&http_static_server_main.tw, >> hs->timer_handle); >> + hs->timer_handle = ~0; >> clib_spinlock_unlock (&http_static_server_main.tw_lock); >> } >> >> >> >> On Sat, Jul 25, 2020 at 12:13 AM Florin Coras <fcoras.li...@gmail.com> >> wrote: >> >>> Great! Thanks for confirming! >>> >>> Let me know how it goes with the static http server. >>> >>> Cheers, >>> Florin >>> >>> On Jul 24, 2020, at 2:00 PM, Ivan Shvedunov <ivan...@gmail.com> wrote: >>> >>> Hi Florin, >>> I re-verified the patches and the modified patch doesn't crash either, >>> so I think it's safe to merge it. >>> Thanks! >>> >>> I will try to see what is the remaining problem with http_static >>> >>> On Fri, Jul 24, 2020 at 8:15 PM Florin Coras <fcoras.li...@gmail.com> >>> wrote: >>> >>>> Hi Ivan, >>>> >>>> Adding Vanessa to see if she can help with the account issues. >>>> >>>> Thanks a lot for the patches! Pushed them here [1] and [2]. I took the >>>> liberty of slightly changing [2], so if you get a chance, do try it out >>>> again. >>>> >>>> Finally, the static http server still needs fixes. Most probably it >>>> mishandles cleanups and corrupts the sessions/connections under heavy load. >>>> >>>> Regards, >>>> Florin >>>> >>>> [1] https://gerrit.fd.io/r/c/vpp/+/28063 >>>> [2] https://gerrit.fd.io/r/c/vpp/+/28064 >>>> >>>> >>>> On Jul 24, 2020, at 9:27 AM, Ivan Shvedunov <ivan...@gmail.com> wrote: >>>> >>>> I did a bit more debugging and found an issue that was causing >>>> invalid TCP connection lookups. >>>> Basically, if session_connected_callback was failing for an app (in >>>> case of proxy, e.g. b/c the other corresponding connection got closed), it >>>> was leaving an invalid entry in the session lookup table. >>>> Another less serious issue that I found while debugging this issue is >>>> that when TCP elog is enabled in tcp_debug.h, the TCP stack crashes upon >>>> accepting connections because of misplaced TCP_EVT. >>>> After fixing these issues, my test no longer crashes with the proxy >>>> example. That being said, http_static plugin still crashes with the timer >>>> error. >>>> >>>> I tried to submit both patches to fd.io Gerrit, but as I didn't submit >>>> any CRs to this Gerrit before, I ran into a problem with my old email in >>>> Linux Foundation account settings. I've fixed that and then updated the >>>> email address in fd.io Gerrit too, but still getting the same error: >>>> remote: ERROR: commit e98791a: email address ivan...@gmail.com is not >>>> registered in your account, and you lack 'forge committer' permission. >>>> >>>> Because of these issues I'm attaching both patches here for now. >>>> >>>> On Thu, Jul 23, 2020 at 6:47 PM Florin Coras <fcoras.li...@gmail.com> >>>> wrote: >>>> >>>>> Ah, I didn’t try running test.sh 80. The only difference in how I’m >>>>> running the test is that I start vpp outside of start.sh straight from >>>>> binaries. >>>>> >>>>> Regards, >>>>> Florin >>>>> >>>>> On Jul 23, 2020, at 8:22 AM, Ivan Shvedunov <ivan...@gmail.com> wrote: >>>>> >>>>> Well, I always run the same test, the difference being only >>>>> "test.sh 80" for http_static (it's configured to be listening on that >>>>> port) >>>>> or just "test.sh" for the proxy. As far as I understand, you run the >>>>> tests without using the containers, does that include setting up netem >>>>> like >>>>> this [1] ? >>>>> >>>>> [1] >>>>> https://github.com/ivan4th/vpp-tcp-test/blob/a3b02ec/start.sh#L34-L35 >>>>> >>>>> On Thu, Jul 23, 2020 at 5:10 PM Florin Coras <fcoras.li...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Ivan, >>>>>> >>>>>> Updated [1] but I’m not seeing [3] after several test iterations. >>>>>> >>>>>> Probably the static server needs the same treatment as the proxy. Are >>>>>> you running a slightly different test? All of the builtin apps have the >>>>>> potential to crash vpp or leave the host stack in an unwanted state since >>>>>> they run inline. >>>>>> >>>>>> Either way, to solve this, first step would be to get rid of error >>>>>> like, “no http session for thread 0 session_index x”. Will eventually try >>>>>> to look into it if nobody beats me to it. >>>>>> >>>>>> Regards, >>>>>> Florin >>>>>> >>>>>> On Jul 23, 2020, at 4:59 AM, Ivan Shvedunov <ivan...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> http_static produces some errors: >>>>>> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http >>>>>> session for thread 0 session_index 4124 >>>>>> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http >>>>>> session for thread 0 session_index 4124 >>>>>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp >>>>>> error state CLOSE_WAIT flags 0x02 SYN >>>>>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp >>>>>> error state CLOSE_WAIT flags 0x02 SYN >>>>>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13350 disp >>>>>> error state CLOSE_WAIT flags 0x02 SYN >>>>>> >>>>>> also with multiple different CP connection states related to >>>>>> connections being closed and receiving SYN / SYN+ACK. >>>>>> The release build crashes (did already happen before, so it's >>>>>> unrelated to any of the fixes [1]): >>>>>> >>>>>> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response >>>>>> header! >>>>>> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response >>>>>> header! >>>>>> >>>>>> Program received signal SIGSEGV, Segmentation fault. >>>>>> 0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828, >>>>>> elt=<optimized out>) >>>>>> at /src/vpp/src/vppinfra/tw_timer_template.c:154 >>>>>> 154 /src/vpp/src/vppinfra/tw_timer_template.c: No such file or >>>>>> directory. >>>>>> #0 0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828, >>>>>> elt=<optimized out>) at >>>>>> /src/vpp/src/vppinfra/tw_timer_template.c:154 >>>>>> #1 tw_timer_stop_2t_1w_2048sl ( >>>>>> tw=0x7fffb0967728 <http_static_server_main+288>, handle=7306) >>>>>> at /src/vpp/src/vppinfra/tw_timer_template.c:374 >>>>>> ---Type <return> to continue, or q <return> to quit--- >>>>>> #2 0x00007fffb076146f in http_static_server_session_timer_stop >>>>>> (hs=<optimized out>) >>>>>> at /src/vpp/src/plugins/http_static/static_server.c:126 >>>>>> #3 http_static_server_rx_tx_callback (s=0x7fffb5e13a40, >>>>>> cf=CALLED_FROM_RX) >>>>>> at /src/vpp/src/plugins/http_static/static_server.c:1026 >>>>>> #4 0x00007fffb0760eb8 in http_static_server_rx_callback ( >>>>>> s=0x7fffb0967728 <http_static_server_main+288>) >>>>>> at /src/vpp/src/plugins/http_static/static_server.c:1037 >>>>>> #5 0x00007ffff774a9de in app_worker_builtin_rx (app_wrk=<optimized >>>>>> out>, s=0x7fffb5e13a40) >>>>>> at /src/vpp/src/vnet/session/application_worker.c:485 >>>>>> #6 app_send_io_evt_rx (app_wrk=<optimized out>, s=0x7fffb5e13a40) >>>>>> at /src/vpp/src/vnet/session/application_worker.c:691 >>>>>> #7 0x00007ffff7713d9a in session_enqueue_notify_inline >>>>>> (s=0x7fffb5e13a40) >>>>>> at /src/vpp/src/vnet/session/session.c:632 >>>>>> #8 0x00007ffff7713fd1 in session_main_flush_enqueue_events >>>>>> (transport_proto=<optimized out>, >>>>>> thread_index=0) at /src/vpp/src/vnet/session/session.c:736 >>>>>> #9 0x00007ffff63960e9 in tcp46_established_inline (vm=0x7ffff5ddc6c0 >>>>>> <vlib_global_main>, >>>>>> node=<optimized out>, frame=<optimized out>, is_ip4=1) at >>>>>> /src/vpp/src/vnet/tcp/tcp_input.c:1558 >>>>>> #10 tcp4_established_node_fn_hsw (vm=0x7ffff5ddc6c0 >>>>>> <vlib_global_main>, node=<optimized out>, >>>>>> from_frame=0x7fffb5458480) at >>>>>> /src/vpp/src/vnet/tcp/tcp_input.c:1573 >>>>>> #11 0x00007ffff5b5f509 in dispatch_node (vm=0x7ffff5ddc6c0 >>>>>> <vlib_global_main>, node=0x7fffb4baf400, >>>>>> type=VLIB_NODE_TYPE_INTERNAL, >>>>>> dispatch_state=VLIB_NODE_STATE_POLLING, frame=<optimized out>, >>>>>> last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1194 >>>>>> #12 dispatch_pending_node (vm=0x7ffff5ddc6c0 <vlib_global_main>, >>>>>> pending_frame_index=<optimized out>, >>>>>> last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1353 >>>>>> #13 vlib_main_or_worker_loop (vm=<optimized out>, is_main=1) at >>>>>> /src/vpp/src/vlib/main.c:1848 >>>>>> #14 vlib_main_loop (vm=<optimized out>) at >>>>>> /src/vpp/src/vlib/main.c:1976 >>>>>> #15 0x00007ffff5b5daf0 in vlib_main (vm=0x7ffff5ddc6c0 >>>>>> <vlib_global_main>, input=0x7fffb4762fb0) >>>>>> at /src/vpp/src/vlib/main.c:2222 >>>>>> #16 0x00007ffff5bc2816 in thread0 (arg=140737318340288) at >>>>>> /src/vpp/src/vlib/unix/main.c:660 >>>>>> #17 0x00007ffff4fa9ec4 in clib_calljmp () from >>>>>> /usr/lib/x86_64-linux-gnu/libvppinfra.so.20.09 >>>>>> #18 0x00007fffffffd8b0 in ?? () >>>>>> #19 0x00007ffff5bc27c8 in vlib_unix_main (argc=<optimized out>, >>>>>> argv=<optimized out>) >>>>>> at /src/vpp/src/vlib/unix/main.c:733 >>>>>> >>>>>> [1] >>>>>> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-release-http_static-timer_remove.log >>>>>> >>>>>> On Thu, Jul 23, 2020 at 2:47 PM Ivan Shvedunov via lists.fd.io >>>>>> <ivan4th=gmail....@lists.fd.io> wrote: >>>>>> >>>>>>> Hi, >>>>>>> I've found a problem with the timer fix and commented in Gerrit [1] >>>>>>> accordingly. >>>>>>> Basically this change [2] makes the tcp_prepare_retransmit_segment() >>>>>>> issue go away for me. >>>>>>> >>>>>>> Concerning the proxy example, I can no longer see the SVM FIFO >>>>>>> crashes, but when using debug build, VPP crashes with this error (full >>>>>>> log >>>>>>> [3]) during my test: >>>>>>> /usr/bin/vpp[39]: /src/vpp/src/vnet/tcp/tcp_input.c:2857 >>>>>>> (tcp46_input_inline) assertion `tcp_lookup_is_valid (tc1, b[1], >>>>>>> tcp_buffer_hdr (b[1]))' fails >>>>>>> >>>>>>> When using release build, it produces a lot of messages like this >>>>>>> instead: >>>>>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 15168 >>>>>>> disp error state CLOSE_WAIT flags 0x02 SYN >>>>>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 9417 disp >>>>>>> error state FIN_WAIT_2 flags 0x12 SYN ACK >>>>>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 10703 >>>>>>> disp error state TIME_WAIT flags 0x12 SYN ACK >>>>>>> >>>>>>> and also >>>>>>> >>>>>>> /usr/bin/vpp[39]: active_open_connected_callback:439: connection >>>>>>> 85557 failed! >>>>>>> >>>>>>> [1] >>>>>>> https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39 >>>>>>> [2] >>>>>>> https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca >>>>>>> [3] >>>>>>> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71 >>>>>>> >>>>>>> I will look into src/vcl/test/test_vcl.py to see if I can reproduce >>>>>>> something like my test there, thanks! >>>>>>> And waiting for Dave's input concerning the CSIT part, too, of >>>>>>> course. >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 23, 2020 at 5:22 AM Florin Coras <fcoras.li...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Ivan, >>>>>>>> >>>>>>>> Thanks for the test. After modifying it a bit to run straight from >>>>>>>> binaries, I managed to repro the issue. As expected, the proxy is not >>>>>>>> cleaning up the sessions correctly (example apps do run out of sync >>>>>>>> ..). >>>>>>>> Here’s a quick patch that solves some of the obvious issues [1] (note >>>>>>>> that >>>>>>>> it’s chained with gerrit 27952). I didn’t do too much testing, so let >>>>>>>> me >>>>>>>> know if you hit some other problems. As far as I can tell, 27952 is >>>>>>>> needed. >>>>>>>> >>>>>>>> As for the CI, I guess there are two types of tests we might want >>>>>>>> (cc-ing Dave since he has experience with this): >>>>>>>> - functional test that could live as part of “make test” infra. The >>>>>>>> host stack already has some functional integration tests, i.e., the vcl >>>>>>>> tests in src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We >>>>>>>> could >>>>>>>> do something similar for the proxy app, but the tests need to be >>>>>>>> lightweight as they’re run as part of the verify jobs >>>>>>>> - CSIT scale/performance tests. We could use something like your >>>>>>>> scripts to test the proxy but also ld_preload + nginx and other >>>>>>>> applications. Dave should have more opinions here :-) >>>>>>>> >>>>>>>> Regards, >>>>>>>> Florin >>>>>>>> >>>>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/28041 >>>>>>>> >>>>>>>> On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Concerning the CI: I'd be glad to add that test to "make test", but >>>>>>>> not sure how to approach it. The test is not about containers but more >>>>>>>> about using network namespaces and some tools like wrk to create a lot >>>>>>>> of >>>>>>>> TCP connections to do some "stress testing" of VPP host stack (and as >>>>>>>> it >>>>>>>> was noted, it fails not only on the proxy example, but also on >>>>>>>> http_static >>>>>>>> plugin). It's probably doable w/o any external tooling at all, and even >>>>>>>> without the network namespaces either, using only VPP's own TCP stack, >>>>>>>> but >>>>>>>> that is probably rather hard. Could you suggest some ideas how it >>>>>>>> could be >>>>>>>> added to "make test"? Should I add a `test_....py` under `tests/` that >>>>>>>> creates host interfaces in VPP and uses these via OS networking >>>>>>>> instead of >>>>>>>> the packet generator? As far as I can see there's something like that >>>>>>>> in >>>>>>>> srv6-mobile plugin [1]. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125 >>>>>>>> >>>>>>>> On Wed, Jul 22, 2020 at 8:25 PM Florin Coras < >>>>>>>> fcoras.li...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I missed the point about the CI in my other reply. If we can >>>>>>>>> somehow integrate some container based tests into the “make test” >>>>>>>>> infra, I >>>>>>>>> wouldn’t mind at all! :-) >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Florin >>>>>>>>> >>>>>>>>> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> sadly the patch apparently didn't work. It should have worked but >>>>>>>>> for some reason it didn't ... >>>>>>>>> >>>>>>>>> On the bright side, I've made a test case [1] using fresh upstream >>>>>>>>> VPP code with no UPF that reproduces the issues I mentioned, >>>>>>>>> including both >>>>>>>>> timer and TCP retransmit one along with some other possible problems >>>>>>>>> using >>>>>>>>> http_static plugin and the proxy example, along with nginx (with >>>>>>>>> proxy) and >>>>>>>>> wrk. >>>>>>>>> >>>>>>>>> It is docker-based, but the main scripts (start.sh and test.sh) >>>>>>>>> can be used without Docker, too. >>>>>>>>> I've used our own Dockerfiles to build the images, but I'm not >>>>>>>>> sure if that makes any difference. >>>>>>>>> I've added some log files resulting from the runs that crashed in >>>>>>>>> different places. For me, the tests crash on each run, but in >>>>>>>>> different >>>>>>>>> places. >>>>>>>>> >>>>>>>>> The TCP retransmit problem happens with http_static when using the >>>>>>>>> debug build. When using release build, some unrelated crash >>>>>>>>> in timer_remove() happens instead. >>>>>>>>> The SVM FIFO crash happens when using the proxy. It can happen >>>>>>>>> with both release and debug builds. >>>>>>>>> >>>>>>>>> Please see the repo [1] for details and crash logs. >>>>>>>>> >>>>>>>>> [1] https://github.com/ivan4th/vpp-tcp-test >>>>>>>>> >>>>>>>>> P.S. As the tests do expose some problems with VPP host stack and >>>>>>>>> some of the VPP plugins/examples, maybe we should consider adding >>>>>>>>> them to >>>>>>>>> the VPP CI, too? >>>>>>>>> >>>>>>>>> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras < >>>>>>>>> fcoras.li...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Ivan, >>>>>>>>>> >>>>>>>>>> Thanks for the detailed report! >>>>>>>>>> >>>>>>>>>> I assume this is a situation where most of the connections time >>>>>>>>>> out and the rate limiting we apply on the pending timer queue delays >>>>>>>>>> handling for long enough to be in a situation like the one you >>>>>>>>>> described. >>>>>>>>>> Here’s a draft patch that starts tracking pending timers [1]. Let me >>>>>>>>>> know >>>>>>>>>> if it solves the first problem. >>>>>>>>>> >>>>>>>>>> Regarding the second, it looks like the first chunk in the fifo >>>>>>>>>> is not properly initialized/corrupted. It’s hard to tell what leads >>>>>>>>>> to that >>>>>>>>>> given that I haven’t seen this sort of issues even with larger >>>>>>>>>> number of >>>>>>>>>> connections. You could maybe try calling svm_fifo_is_sane() in the >>>>>>>>>> enqueue/dequeue functions, or after the proxy allocates/shares the >>>>>>>>>> fifos to >>>>>>>>>> catch the issue as early as possible. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Florin >>>>>>>>>> >>>>>>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/27952 >>>>>>>>>> >>>>>>>>>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> I'm working on the Travelping UPF project >>>>>>>>>> https://github.com/travelping/vpp >>>>>>>>>> <https://github.com/travelping/vpp>For variety of reasons, it's >>>>>>>>>> presently maintained as a fork of UPF that's rebased on top of >>>>>>>>>> upstream >>>>>>>>>> master from time to time, but really it's just a plugin. During 40K >>>>>>>>>> TCP >>>>>>>>>> connection test with netem, I found an issue with TCP timer race >>>>>>>>>> (timers >>>>>>>>>> firing after tcp_timer_reset() was called for them) which I tried to >>>>>>>>>> work >>>>>>>>>> around only to stumble into another crash, which I'm presently >>>>>>>>>> debugging >>>>>>>>>> (an SVM FIFO bug, possibly) but maybe some of you folks have some >>>>>>>>>> ideas >>>>>>>>>> what it could be. >>>>>>>>>> I've described my findings in this JIRA ticket: >>>>>>>>>> https://jira.fd.io/browse/VPP-1923 >>>>>>>>>> Although the last upstream commit UPF is presently based on >>>>>>>>>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1) >>>>>>>>>> was some time ago, I believe the problems are still relevant as >>>>>>>>>> there were no changes in these parts of code in master since that >>>>>>>>>> commit. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ivan Shvedunov <ivan...@gmail.com> >>>>>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 >>>>>>>>> 0BC5 2807 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ivan Shvedunov <ivan...@gmail.com> >>>>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 >>>>>>>> 0BC5 2807 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ivan Shvedunov <ivan...@gmail.com> >>>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 >>>>>>> 0BC5 2807 >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ivan Shvedunov <ivan...@gmail.com> >>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 >>>>>> 0BC5 2807 >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ivan Shvedunov <ivan...@gmail.com> >>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 >>>>> 0BC5 2807 >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Ivan Shvedunov <ivan...@gmail.com> >>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 >>>> 2807 >>>> <0002-Fix-initializing-elog-track-for-accepted-TCP-connect.patch> >>>> <0001-Fix-bad-session-lookups-after-failed-session_connect.patch> >>>> >>>> >>>> >>> >>> -- >>> Ivan Shvedunov <ivan...@gmail.com> >>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 >>> 2807 >>> >>> >>> >> >> -- >> Ivan Shvedunov <ivan...@gmail.com> >> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 >> 2807 >> >> >> > > -- > Ivan Shvedunov <ivan...@gmail.com> > ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 > 2807 > > > -- Ivan Shvedunov <ivan...@gmail.com> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 2807
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17108): https://lists.fd.io/g/vpp-dev/message/17108 Mute This Topic: https://lists.fd.io/mt/75537746/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-