Hi Florin,
thanks, the fix has worked and http_static no longer crashes.

I still get a number of messages like this when using release build:
/usr/bin/vpp[39]: state_sent_ok:954: BUG: couldn't send response header!
Not sure if it's actually a bug or the queue being actually full because of
the packet loss being caused by netem.

There are also numerous messages like
/usr/bin/vpp[39]: tcp_input_dispatch_buffer:2817: tcp conn 13836 disp error
state CLOSE_WAIT flags 0x02 SYN
but I guess that's caused by a large number of connections from the same
client => port reuse and, again, the packet loss.

I've spotted another crash in the proxy example for which I've submitted a
patch:
https://gerrit.fd.io/r/c/vpp/+/28085

One problem with this fix is that the proxy example is actually incomplete,
as the client connection should be dropped in case if the proxy can't
establish the corresponding outgoing connection (e.g. timeout or RST b/c of
"connection refused"), whereas currently the client-side connection just
hangs instead. I'll take a stab at fixing that as our proxy in UPF also has
problems with handling similar situations and it would be easier for me to
fix it in this simpler example first. But, in any case, crashing VPP is
probably much worse than just letting the connection hang, so maybe we
should remove that ASSERT after all.


On Mon, Jul 27, 2020 at 10:59 PM Florin Coras <fcoras.li...@gmail.com>
wrote:

> Hi Ivan,
>
> Took a look at the static http server and, as far as I can tell, it has
> the same type of issue the proxy had, i.e., premature session
> cleanup/reuse. Does this solve the problem for you [1]?
>
> Also, merged your elog fix patch. Thanks!
>
> Regards,
> Florin
>
> [1]  https://gerrit.fd.io/r/c/vpp/+/28076
>
> On Jul 27, 2020, at 10:22 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>
> Hi.
> I've debugged http server issue a bit more and here are my observations:
> if I add an ASSERT(0) in the place of "No http session for thread 0
> session_index 54",
> I get stack trace along the lines of
> Program received signal SIGABRT, Aborted.
> 0x00007ffff470bf47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x00007ffff470bf47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007ffff470d8b1 in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x0000000000407193 in os_panic () at /src/vpp/src/vpp/vnet/main.c:371
> #3  0x00007ffff55ec619 in debugger () at /src/vpp/src/vppinfra/error.c:84
> #4  0x00007ffff55ec397 in _clib_error (how_to_die=2, function_name=0x0,
> line_number=0,
>     fmt=0x7fffb0e1324b "%s:%d (%s) assertion `%s' fails")
>     at /src/vpp/src/vppinfra/error.c:143
> #5  0x00007fffb0e0cc3d in http_static_server_rx_tx_callback
> (s=0x7fffb66447c0,
>     cf=CALLED_FROM_RX) at
> /src/vpp/src/plugins/http_static/static_server.c:1030
> #6  0x00007fffb0e0b437 in http_static_server_rx_callback (s=0x7fffb66447c0)
>     at /src/vpp/src/plugins/http_static/static_server.c:1056
> #7  0x00007ffff745a94e in app_worker_builtin_rx (app_wrk=0x7fffb59c7e00,
>     s=0x7fffb66447c0) at /src/vpp/src/vnet/session/application_worker.c:485
> #8  0x00007ffff745c00e in app_send_io_evt_rx (app_wrk=0x7fffb59c7e00,
> s=0x7fffb66447c0)
>     at /src/vpp/src/vnet/session/application_worker.c:691
> #9  0x00007ffff745b34e in app_worker_lock_and_send_event
> (app=0x7fffb59c7e00,
>     s=0x7fffb66447c0, evt_type=0 '\000')
>     at /src/vpp/src/vnet/session/application_worker.c:779
> #10 0x00007ffff73eda5e in session_enqueue_notify_inline (s=0x7fffb66447c0)
>     at /src/vpp/src/vnet/session/session.c:632
> #11 0x00007ffff73eddbf in session_main_flush_enqueue_events
> (transport_proto=0 '\000',
>     thread_index=0) at /src/vpp/src/vnet/session/session.c:736
> #12 0x00007ffff680c2d4 in tcp46_rcv_process_inline (
>     vm=0x7ffff6477b80 <vlib_global_main>, node=0x7fffb522a680,
>     from_frame=0x7fffb5a05f80, is_ip4=1) at
> /src/vpp/src/vnet/tcp/tcp_input.c:2478
> #13 0x00007ffff680b5ea in tcp4_rcv_process_node_fn_hsw
>
> (line numbers maybe slightly off due to debug print).
> If I further dump the connection elog for such sessions I get
> /usr/bin/vpp[39]: http_static_server_rx_tx_callback:1028: ZZZZZ:
> connection elog:
>        92.494240039: close: cidx 52
>        92.494241379: fin-tx: snd_nxt 1573 rcv_nxt 235
>        92.494241431: state: fin-wait-1
>        92.495088424: out: flags 11, bytes 32
>        92.922939929: timer-pop: retransmit cidx 52 sidx 54
>        92.922978305: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961
> snd_wnd 64128
>        92.922978440: rcv_stat: rto 344 srtt 80 mrtt-us 39371 rttvar 66
>        92.922978491: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt 262
>        92.922978711: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0
> flight 1
>        92.922980365: fin-tx: snd_nxt 1573 rcv_nxt 235
>        92.923492112: out: flags 11, bytes 32
>        93.711846499: timer-pop: retransmit cidx 52 sidx 54
>        93.735094287: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961
> snd_wnd 64128
>        93.735094392: rcv_stat: rto 688 srtt 80 mrtt-us 39371 rttvar 66
>        93.735094448: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt 1
>        93.735094532: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0
> flight 1
>        93.735095835: fin-tx: snd_nxt 1573 rcv_nxt 235
>        93.735582660: out: flags 11, bytes 32
>        94.568121707: timer-pop: retransmit cidx 52 sidx 54
>        94.584413829: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961
> snd_wnd 64128
>        94.584414153: rcv_stat: rto 1376 srtt 80 mrtt-us 39371 rttvar 66
>        94.584414259: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt 8192
>        94.584414333: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0
> flight 1
>        94.584417424: fin-tx: snd_nxt 1573 rcv_nxt 235
>        94.585024738: out: flags 11, bytes 32
>        95.999473091: timer-pop: retransmit cidx 52 sidx 54
>        96.088472363: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961
> snd_wnd 64128
>        96.088472647: rcv_stat: rto 2752 srtt 80 mrtt-us 39371 rttvar 66
>        96.088472758: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt 8192
>        96.088472842: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0
> flight 1
>        96.088475171: fin-tx: snd_nxt 1573 rcv_nxt 235
>        96.126505063: out: flags 11, bytes 32
>        98.861064635: timer-pop: retransmit cidx 52 sidx 54
>        98.898809592: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961
> snd_wnd 64128
>        98.898809726: rcv_stat: rto 5504 srtt 80 mrtt-us 39371 rttvar 66
>        98.898809768: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt 8192
>        98.898809828: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0
> flight 1
>        98.898811259: fin-tx: snd_nxt 1573 rcv_nxt 235
>        98.899350729: out: flags 11, bytes 32
>       104.181435961: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961
> snd_wnd 64128
>       104.181436051: rcv_stat: rto 11008 srtt 80 mrtt-us 39371 rttvar 66
>       104.181436102: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt
> 55986
>       104.181436602: ack-rx: 1 snd_una 1574 snd_wnd 64128 cwnd 2234
> inflight 0
>       104.181437352: state: fin-wait-2
>       104.181438642: in: order len 39 written 39 rcv_nxt 274 rcv_wnd(o)
> 8192
>       104.181439438: fin-rx: snd_nxt 1574 rcv_nxt 274
>       104.181439499: state: time-wait
>
> which, along with some added debug print, indicates
> that http_static_server_rx_callback() on a connection that was already
> closed and for which http_static_server_session_disconnect_callback() was
> already called.
> Looks like some of the enqueued data is pushed
> via session_main_flush_enqueue_events() after the disconnect callback. At
> this point, I'm not sure how that should be fixed, but probably it should
> be as invoking rx callback after disconnect one doesn't seem to be the
> expected behaviour.
>
> I also noticed another problem with uninitialized elog track with TCP
> debugging enabled for which I submitted a fix (Gerrit worked for me this
> time), basically that TCP_EVT (TCP_EVT_SYN_RCVD, child, 1); that was moved
> in one of previous fixes must be moved even further up, as in case of
> failed session_stream_accept() another elog crash will happen
> in tcp_connection_cleanup() if the elog track is not initialized.
> The patch is here: https://gerrit.fd.io/r/c/vpp/+/28074
>
> As of the timer crash in http_static in case of the release build, I'm a
> bit unsure but it looks like http_static_server_session_timer_stop() is not
> idempotent, as it doesn't reset the timer handle. If you think this is a
> correct guess, I can re-verify this patch and submit it.
>
> diff --git a/src/plugins/http_static/static_server.c
> b/src/plugins/http_static/static_server.c
> index 45d8731af..7772a9ec0 100644
> --- a/src/plugins/http_static/static_server.c
> +++ b/src/plugins/http_static/static_server.c
> @@ -124,6 +124,7 @@ http_static_server_session_timer_stop (http_session_t
> * hs)
>      return;
>    clib_spinlock_lock (&http_static_server_main.tw_lock);
>    tw_timer_stop_2t_1w_2048sl (&http_static_server_main.tw,
> hs->timer_handle);
> +  hs->timer_handle = ~0;
>    clib_spinlock_unlock (&http_static_server_main.tw_lock);
>  }
>
>
>
> On Sat, Jul 25, 2020 at 12:13 AM Florin Coras <fcoras.li...@gmail.com>
> wrote:
>
>> Great! Thanks for confirming!
>>
>> Let me know how it goes with the static http server.
>>
>> Cheers,
>> Florin
>>
>> On Jul 24, 2020, at 2:00 PM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>
>> Hi Florin,
>> I re-verified the patches and the modified patch doesn't crash either, so
>> I think it's safe to merge it.
>> Thanks!
>>
>> I will try to see what is the remaining problem with http_static
>>
>> On Fri, Jul 24, 2020 at 8:15 PM Florin Coras <fcoras.li...@gmail.com>
>> wrote:
>>
>>> Hi Ivan,
>>>
>>> Adding Vanessa to see if she can help with the account issues.
>>>
>>> Thanks a lot for the patches! Pushed them here [1] and [2]. I took the
>>> liberty of slightly changing [2], so if you get a chance, do try it out
>>> again.
>>>
>>> Finally, the static http server still needs fixes. Most probably it
>>> mishandles cleanups and corrupts the sessions/connections under heavy load.
>>>
>>> Regards,
>>> Florin
>>>
>>> [1] https://gerrit.fd.io/r/c/vpp/+/28063
>>> [2] https://gerrit.fd.io/r/c/vpp/+/28064
>>>
>>>
>>> On Jul 24, 2020, at 9:27 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>>
>>> I did a bit more debugging and found an issue that was causing
>>> invalid TCP connection lookups.
>>> Basically, if session_connected_callback was failing for an app (in case
>>> of proxy, e.g. b/c the other corresponding connection got closed), it was
>>> leaving an invalid entry in the session lookup table.
>>> Another less serious issue that I found while debugging this issue is
>>> that when TCP elog is enabled in tcp_debug.h, the TCP stack crashes upon
>>> accepting connections because of misplaced TCP_EVT.
>>> After fixing these issues, my test no longer crashes with the proxy
>>> example. That being said, http_static plugin still crashes with the timer
>>> error.
>>>
>>> I tried to submit both patches to fd.io Gerrit, but as I didn't submit
>>> any CRs to this Gerrit before, I ran into a problem with my old email in
>>> Linux Foundation account settings. I've fixed that and then updated the
>>> email address in fd.io Gerrit too, but still getting the same error:
>>> remote: ERROR: commit e98791a: email address ivan...@gmail.com is not
>>> registered in your account, and you lack 'forge committer' permission.
>>>
>>> Because of these issues I'm attaching both patches here for now.
>>>
>>> On Thu, Jul 23, 2020 at 6:47 PM Florin Coras <fcoras.li...@gmail.com>
>>> wrote:
>>>
>>>> Ah, I didn’t try running test.sh 80. The only difference in how I’m
>>>> running the test is that I start vpp outside of start.sh straight from
>>>> binaries.
>>>>
>>>> Regards,
>>>> Florin
>>>>
>>>> On Jul 23, 2020, at 8:22 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>>>
>>>> Well, I always run the same test, the difference being only
>>>> "test.sh 80" for http_static (it's configured to be listening on that
>>>> port)
>>>> or just "test.sh" for the proxy. As far as I understand, you run the
>>>> tests without using the containers, does that include setting up netem like
>>>> this [1] ?
>>>>
>>>> [1]
>>>> https://github.com/ivan4th/vpp-tcp-test/blob/a3b02ec/start.sh#L34-L35
>>>>
>>>> On Thu, Jul 23, 2020 at 5:10 PM Florin Coras <fcoras.li...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ivan,
>>>>>
>>>>> Updated [1] but I’m not seeing [3] after several test iterations.
>>>>>
>>>>> Probably the static server needs the same treatment as the proxy. Are
>>>>> you running a slightly different test? All of the builtin apps have the
>>>>> potential to crash vpp or leave the host stack in an unwanted state since
>>>>> they run inline.
>>>>>
>>>>> Either way, to solve this, first step would be to get rid of error
>>>>> like, “no http session for thread 0 session_index x”. Will eventually try
>>>>> to look into it if nobody beats me to it.
>>>>>
>>>>> Regards,
>>>>> Florin
>>>>>
>>>>> On Jul 23, 2020, at 4:59 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>>>>
>>>>> http_static produces some errors:
>>>>> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http
>>>>> session for thread 0 session_index 4124
>>>>> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http
>>>>> session for thread 0 session_index 4124
>>>>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp
>>>>> error state CLOSE_WAIT flags 0x02 SYN
>>>>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp
>>>>> error state CLOSE_WAIT flags 0x02 SYN
>>>>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13350 disp
>>>>> error state CLOSE_WAIT flags 0x02 SYN
>>>>>
>>>>> also with multiple different CP connection states related to
>>>>> connections being closed and receiving SYN / SYN+ACK.
>>>>> The release build crashes (did already happen before, so it's
>>>>> unrelated to any of the fixes [1]):
>>>>>
>>>>> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response
>>>>> header!
>>>>> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response
>>>>> header!
>>>>>
>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>> 0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828,
>>>>> elt=<optimized out>)
>>>>>     at /src/vpp/src/vppinfra/tw_timer_template.c:154
>>>>> 154     /src/vpp/src/vppinfra/tw_timer_template.c: No such file or
>>>>> directory.
>>>>> #0  0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828,
>>>>>     elt=<optimized out>) at
>>>>> /src/vpp/src/vppinfra/tw_timer_template.c:154
>>>>> #1  tw_timer_stop_2t_1w_2048sl (
>>>>>     tw=0x7fffb0967728 <http_static_server_main+288>, handle=7306)
>>>>>     at /src/vpp/src/vppinfra/tw_timer_template.c:374
>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>> #2  0x00007fffb076146f in http_static_server_session_timer_stop
>>>>> (hs=<optimized out>)
>>>>>     at /src/vpp/src/plugins/http_static/static_server.c:126
>>>>> #3  http_static_server_rx_tx_callback (s=0x7fffb5e13a40,
>>>>> cf=CALLED_FROM_RX)
>>>>>     at /src/vpp/src/plugins/http_static/static_server.c:1026
>>>>> #4  0x00007fffb0760eb8 in http_static_server_rx_callback (
>>>>>     s=0x7fffb0967728 <http_static_server_main+288>)
>>>>>     at /src/vpp/src/plugins/http_static/static_server.c:1037
>>>>> #5  0x00007ffff774a9de in app_worker_builtin_rx (app_wrk=<optimized
>>>>> out>, s=0x7fffb5e13a40)
>>>>>     at /src/vpp/src/vnet/session/application_worker.c:485
>>>>> #6  app_send_io_evt_rx (app_wrk=<optimized out>, s=0x7fffb5e13a40)
>>>>>     at /src/vpp/src/vnet/session/application_worker.c:691
>>>>> #7  0x00007ffff7713d9a in session_enqueue_notify_inline
>>>>> (s=0x7fffb5e13a40)
>>>>>     at /src/vpp/src/vnet/session/session.c:632
>>>>> #8  0x00007ffff7713fd1 in session_main_flush_enqueue_events
>>>>> (transport_proto=<optimized out>,
>>>>>     thread_index=0) at /src/vpp/src/vnet/session/session.c:736
>>>>> #9  0x00007ffff63960e9 in tcp46_established_inline (vm=0x7ffff5ddc6c0
>>>>> <vlib_global_main>,
>>>>>     node=<optimized out>, frame=<optimized out>, is_ip4=1) at
>>>>> /src/vpp/src/vnet/tcp/tcp_input.c:1558
>>>>> #10 tcp4_established_node_fn_hsw (vm=0x7ffff5ddc6c0
>>>>> <vlib_global_main>, node=<optimized out>,
>>>>>     from_frame=0x7fffb5458480) at
>>>>> /src/vpp/src/vnet/tcp/tcp_input.c:1573
>>>>> #11 0x00007ffff5b5f509 in dispatch_node (vm=0x7ffff5ddc6c0
>>>>> <vlib_global_main>, node=0x7fffb4baf400,
>>>>>     type=VLIB_NODE_TYPE_INTERNAL,
>>>>> dispatch_state=VLIB_NODE_STATE_POLLING, frame=<optimized out>,
>>>>>     last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1194
>>>>> #12 dispatch_pending_node (vm=0x7ffff5ddc6c0 <vlib_global_main>,
>>>>> pending_frame_index=<optimized out>,
>>>>>     last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1353
>>>>> #13 vlib_main_or_worker_loop (vm=<optimized out>, is_main=1) at
>>>>> /src/vpp/src/vlib/main.c:1848
>>>>> #14 vlib_main_loop (vm=<optimized out>) at
>>>>> /src/vpp/src/vlib/main.c:1976
>>>>> #15 0x00007ffff5b5daf0 in vlib_main (vm=0x7ffff5ddc6c0
>>>>> <vlib_global_main>, input=0x7fffb4762fb0)
>>>>>     at /src/vpp/src/vlib/main.c:2222
>>>>> #16 0x00007ffff5bc2816 in thread0 (arg=140737318340288) at
>>>>> /src/vpp/src/vlib/unix/main.c:660
>>>>> #17 0x00007ffff4fa9ec4 in clib_calljmp () from
>>>>> /usr/lib/x86_64-linux-gnu/libvppinfra.so.20.09
>>>>> #18 0x00007fffffffd8b0 in ?? ()
>>>>> #19 0x00007ffff5bc27c8 in vlib_unix_main (argc=<optimized out>,
>>>>> argv=<optimized out>)
>>>>>     at /src/vpp/src/vlib/unix/main.c:733
>>>>>
>>>>> [1]
>>>>> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-release-http_static-timer_remove.log
>>>>>
>>>>> On Thu, Jul 23, 2020 at 2:47 PM Ivan Shvedunov via lists.fd.io
>>>>> <ivan4th=gmail....@lists.fd.io> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I've found a problem with the timer fix and commented in Gerrit [1]
>>>>>> accordingly.
>>>>>> Basically this change [2] makes the tcp_prepare_retransmit_segment()
>>>>>> issue go away for me.
>>>>>>
>>>>>> Concerning the proxy example, I can no longer see the SVM FIFO
>>>>>> crashes, but when using debug build, VPP crashes with this error (full 
>>>>>> log
>>>>>> [3]) during my test:
>>>>>> /usr/bin/vpp[39]: /src/vpp/src/vnet/tcp/tcp_input.c:2857
>>>>>> (tcp46_input_inline) assertion `tcp_lookup_is_valid (tc1, b[1],
>>>>>> tcp_buffer_hdr (b[1]))' fails
>>>>>>
>>>>>> When using release build, it produces a lot of messages like this
>>>>>> instead:
>>>>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 15168 disp
>>>>>> error state CLOSE_WAIT flags 0x02 SYN
>>>>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 9417 disp
>>>>>> error state FIN_WAIT_2 flags 0x12 SYN ACK
>>>>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 10703 disp
>>>>>> error state TIME_WAIT flags 0x12 SYN ACK
>>>>>>
>>>>>> and also
>>>>>>
>>>>>> /usr/bin/vpp[39]: active_open_connected_callback:439: connection
>>>>>> 85557 failed!
>>>>>>
>>>>>> [1]
>>>>>> https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39
>>>>>> [2]
>>>>>> https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca
>>>>>> [3]
>>>>>> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71
>>>>>>
>>>>>> I will look into src/vcl/test/test_vcl.py to see if I can reproduce
>>>>>> something like my test there, thanks!
>>>>>> And waiting for Dave's input concerning the CSIT part, too, of course.
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 23, 2020 at 5:22 AM Florin Coras <fcoras.li...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ivan,
>>>>>>>
>>>>>>> Thanks for the test. After modifying it a bit to run straight from
>>>>>>> binaries, I managed to repro the issue. As expected, the proxy is not
>>>>>>> cleaning up the sessions correctly (example apps do run out of sync ..).
>>>>>>> Here’s a quick patch that solves some of the obvious issues [1] (note 
>>>>>>> that
>>>>>>> it’s chained with gerrit 27952). I didn’t do too much testing, so let me
>>>>>>> know if you hit some other problems. As far as I can tell, 27952 is 
>>>>>>> needed.
>>>>>>>
>>>>>>> As for the CI, I guess there are two types of tests we might want
>>>>>>> (cc-ing Dave since he has experience with this):
>>>>>>> - functional test that could live as part of “make test” infra. The
>>>>>>> host stack already has some functional integration tests, i.e., the vcl
>>>>>>> tests in src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We 
>>>>>>> could
>>>>>>> do something   similar for the proxy app, but the tests need to be
>>>>>>> lightweight as they’re run as part of the verify jobs
>>>>>>> - CSIT scale/performance tests. We could use something like your
>>>>>>> scripts to test the proxy but also ld_preload + nginx and other
>>>>>>> applications. Dave should have more opinions here :-)
>>>>>>>
>>>>>>> Regards,
>>>>>>> Florin
>>>>>>>
>>>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/28041
>>>>>>>
>>>>>>> On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Concerning the CI: I'd be glad to add that test to "make test", but
>>>>>>> not sure how to approach it. The test is not about containers but more
>>>>>>> about using network namespaces and some tools like wrk to create a lot 
>>>>>>> of
>>>>>>> TCP connections to do some "stress testing" of VPP host stack (and as it
>>>>>>> was noted, it fails not only on the proxy example, but also on 
>>>>>>> http_static
>>>>>>> plugin). It's probably doable w/o any external tooling at all, and even
>>>>>>> without the network namespaces either, using only VPP's own TCP stack, 
>>>>>>> but
>>>>>>> that is probably rather hard. Could you suggest some ideas how it could 
>>>>>>> be
>>>>>>> added to "make test"? Should I add a `test_....py` under `tests/` that
>>>>>>> creates host interfaces in VPP and uses these via OS networking instead 
>>>>>>> of
>>>>>>> the packet generator? As far as I can see there's something like that in
>>>>>>> srv6-mobile plugin [1].
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125
>>>>>>>
>>>>>>> On Wed, Jul 22, 2020 at 8:25 PM Florin Coras <fcoras.li...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I missed the point about the CI in my other reply. If we can
>>>>>>>> somehow integrate some container based tests into the “make test” 
>>>>>>>> infra, I
>>>>>>>> wouldn’t mind at all! :-)
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Florin
>>>>>>>>
>>>>>>>> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>> sadly the patch apparently didn't work. It should have worked but
>>>>>>>> for some reason it didn't ...
>>>>>>>>
>>>>>>>> On the bright side, I've made a test case [1] using fresh upstream
>>>>>>>> VPP code with no UPF that reproduces the issues I mentioned, including 
>>>>>>>> both
>>>>>>>> timer and TCP retransmit one along with some other possible problems 
>>>>>>>> using
>>>>>>>> http_static plugin and the proxy example, along with nginx (with 
>>>>>>>> proxy) and
>>>>>>>> wrk.
>>>>>>>>
>>>>>>>> It is docker-based, but the main scripts (start.sh and test.sh) can
>>>>>>>> be used without Docker, too.
>>>>>>>> I've used our own Dockerfiles to build the images, but I'm not sure
>>>>>>>> if that makes any difference.
>>>>>>>> I've added some log files resulting from the runs that crashed in
>>>>>>>> different places. For me, the tests crash on each run, but in different
>>>>>>>> places.
>>>>>>>>
>>>>>>>> The TCP retransmit problem happens with http_static when using the
>>>>>>>> debug build. When using release build, some unrelated crash
>>>>>>>> in timer_remove() happens instead.
>>>>>>>> The SVM FIFO crash happens when using the proxy. It can happen with
>>>>>>>> both release and debug builds.
>>>>>>>>
>>>>>>>> Please see the repo [1] for details and crash logs.
>>>>>>>>
>>>>>>>> [1] https://github.com/ivan4th/vpp-tcp-test
>>>>>>>>
>>>>>>>> P.S. As the tests do expose some problems with VPP host stack and
>>>>>>>> some of the VPP plugins/examples, maybe we should consider adding them 
>>>>>>>> to
>>>>>>>> the VPP CI, too?
>>>>>>>>
>>>>>>>> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <
>>>>>>>> fcoras.li...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Ivan,
>>>>>>>>>
>>>>>>>>> Thanks for the detailed report!
>>>>>>>>>
>>>>>>>>> I assume this is a situation where most of the connections time
>>>>>>>>> out and the rate limiting we apply on the pending timer queue delays
>>>>>>>>> handling for long enough to be in a situation like the one you 
>>>>>>>>> described.
>>>>>>>>> Here’s a draft patch that starts tracking pending timers [1]. Let me 
>>>>>>>>> know
>>>>>>>>> if it solves the first problem.
>>>>>>>>>
>>>>>>>>> Regarding the second, it looks like the first chunk in the fifo is
>>>>>>>>> not properly initialized/corrupted. It’s hard to tell what leads to 
>>>>>>>>> that
>>>>>>>>> given that I haven’t seen this sort of issues even with larger number 
>>>>>>>>> of
>>>>>>>>> connections. You could maybe try calling svm_fifo_is_sane() in the
>>>>>>>>> enqueue/dequeue functions, or after the proxy allocates/shares the 
>>>>>>>>> fifos to
>>>>>>>>> catch the issue as early as possible.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Florin
>>>>>>>>>
>>>>>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/27952
>>>>>>>>>
>>>>>>>>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>   Hi,
>>>>>>>>>   I'm working on the Travelping UPF project
>>>>>>>>> https://github.com/travelping/vpp
>>>>>>>>> <https://github.com/travelping/vpp>For variety of reasons, it's
>>>>>>>>> presently maintained as a fork of UPF that's rebased on top of 
>>>>>>>>> upstream
>>>>>>>>> master from time to time, but really it's just a plugin. During 40K 
>>>>>>>>> TCP
>>>>>>>>> connection test with netem, I found an issue with TCP timer race 
>>>>>>>>> (timers
>>>>>>>>> firing after tcp_timer_reset() was called for them) which I tried to 
>>>>>>>>> work
>>>>>>>>> around only to stumble into another crash, which I'm presently 
>>>>>>>>> debugging
>>>>>>>>> (an SVM FIFO bug, possibly) but maybe some of you folks have some 
>>>>>>>>> ideas
>>>>>>>>> what it could be.
>>>>>>>>>   I've described my findings in this JIRA ticket:
>>>>>>>>> https://jira.fd.io/browse/VPP-1923
>>>>>>>>>   Although the last upstream commit UPF is presently based on 
>>>>>>>>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1)
>>>>>>>>> was some time ago, I believe  the problems are still relevant as
>>>>>>>>> there were no changes in these parts of code in master since that 
>>>>>>>>> commit.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8
>>>>>>>> 0BC5 2807
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8
>>>>>>> 0BC5 2807
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8
>>>>>> 0BC5 2807
>>>>>> 
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8
>>>>> 0BC5 2807
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>>>> 2807
>>>>
>>>>
>>>>
>>>
>>> --
>>> Ivan Shvedunov <ivan...@gmail.com>
>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>>> 2807
>>> <0002-Fix-initializing-elog-track-for-accepted-TCP-connect.patch>
>>> <0001-Fix-bad-session-lookups-after-failed-session_connect.patch>
>>>
>>>
>>>
>>
>> --
>> Ivan Shvedunov <ivan...@gmail.com>
>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>> 2807
>>
>>
>>
>
> --
> Ivan Shvedunov <ivan...@gmail.com>
> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
> 2807
>
>
>

-- 
Ivan Shvedunov <ivan...@gmail.com>
;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17099): https://lists.fd.io/g/vpp-dev/message/17099
Mute This Topic: https://lists.fd.io/mt/75537746/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to