Hi Ivan, 

Updated [1] but I’m not seeing [3] after several test iterations. 

Probably the static server needs the same treatment as the proxy. Are you 
running a slightly different test? All of the builtin apps have the potential 
to crash vpp or leave the host stack in an unwanted state since they run 
inline. 

Either way, to solve this, first step would be to get rid of error like, “no 
http session for thread 0 session_index x”. Will eventually try to look into it 
if nobody beats me to it. 

Regards,
Florin

> On Jul 23, 2020, at 4:59 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
> 
> http_static produces some errors:
> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http session for 
> thread 0 session_index 4124
> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http session for 
> thread 0 session_index 4124
> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp error 
> state CLOSE_WAIT flags 0x02 SYN
> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp error 
> state CLOSE_WAIT flags 0x02 SYN
> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13350 disp error 
> state CLOSE_WAIT flags 0x02 SYN
> 
> also with multiple different CP connection states related to connections 
> being closed and receiving SYN / SYN+ACK.
> The release build crashes (did already happen before, so it's unrelated to 
> any of the fixes [1]):
> 
> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header!
> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header!
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828, elt=<optimized out>)
>     at /src/vpp/src/vppinfra/tw_timer_template.c:154
> 154     /src/vpp/src/vppinfra/tw_timer_template.c: No such file or directory.
> #0  0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828,
>     elt=<optimized out>) at /src/vpp/src/vppinfra/tw_timer_template.c:154
> #1  tw_timer_stop_2t_1w_2048sl (
>     tw=0x7fffb0967728 <http_static_server_main+288>, handle=7306)
>     at /src/vpp/src/vppinfra/tw_timer_template.c:374
> ---Type <return> to continue, or q <return> to quit---
> #2  0x00007fffb076146f in http_static_server_session_timer_stop 
> (hs=<optimized out>)
>     at /src/vpp/src/plugins/http_static/static_server.c:126
> #3  http_static_server_rx_tx_callback (s=0x7fffb5e13a40, cf=CALLED_FROM_RX)
>     at /src/vpp/src/plugins/http_static/static_server.c:1026
> #4  0x00007fffb0760eb8 in http_static_server_rx_callback (
>     s=0x7fffb0967728 <http_static_server_main+288>)
>     at /src/vpp/src/plugins/http_static/static_server.c:1037
> #5  0x00007ffff774a9de in app_worker_builtin_rx (app_wrk=<optimized out>, 
> s=0x7fffb5e13a40)
>     at /src/vpp/src/vnet/session/application_worker.c:485
> #6  app_send_io_evt_rx (app_wrk=<optimized out>, s=0x7fffb5e13a40)
>     at /src/vpp/src/vnet/session/application_worker.c:691
> #7  0x00007ffff7713d9a in session_enqueue_notify_inline (s=0x7fffb5e13a40)
>     at /src/vpp/src/vnet/session/session.c:632
> #8  0x00007ffff7713fd1 in session_main_flush_enqueue_events 
> (transport_proto=<optimized out>,
>     thread_index=0) at /src/vpp/src/vnet/session/session.c:736
> #9  0x00007ffff63960e9 in tcp46_established_inline (vm=0x7ffff5ddc6c0 
> <vlib_global_main>,
>     node=<optimized out>, frame=<optimized out>, is_ip4=1) at 
> /src/vpp/src/vnet/tcp/tcp_input.c:1558
> #10 tcp4_established_node_fn_hsw (vm=0x7ffff5ddc6c0 <vlib_global_main>, 
> node=<optimized out>,
>     from_frame=0x7fffb5458480) at /src/vpp/src/vnet/tcp/tcp_input.c:1573
> #11 0x00007ffff5b5f509 in dispatch_node (vm=0x7ffff5ddc6c0 
> <vlib_global_main>, node=0x7fffb4baf400,
>     type=VLIB_NODE_TYPE_INTERNAL, dispatch_state=VLIB_NODE_STATE_POLLING, 
> frame=<optimized out>,
>     last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1194
> #12 dispatch_pending_node (vm=0x7ffff5ddc6c0 <vlib_global_main>, 
> pending_frame_index=<optimized out>,
>     last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1353
> #13 vlib_main_or_worker_loop (vm=<optimized out>, is_main=1) at 
> /src/vpp/src/vlib/main.c:1848
> #14 vlib_main_loop (vm=<optimized out>) at /src/vpp/src/vlib/main.c:1976
> #15 0x00007ffff5b5daf0 in vlib_main (vm=0x7ffff5ddc6c0 <vlib_global_main>, 
> input=0x7fffb4762fb0)
>     at /src/vpp/src/vlib/main.c:2222
> #16 0x00007ffff5bc2816 in thread0 (arg=140737318340288) at 
> /src/vpp/src/vlib/unix/main.c:660
> #17 0x00007ffff4fa9ec4 in clib_calljmp () from 
> /usr/lib/x86_64-linux-gnu/libvppinfra.so.20.09
> #18 0x00007fffffffd8b0 in ?? ()
> #19 0x00007ffff5bc27c8 in vlib_unix_main (argc=<optimized out>, 
> argv=<optimized out>)
>     at /src/vpp/src/vlib/unix/main.c:733
> 
> [1] 
> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-release-http_static-timer_remove.log
>  
> <https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-release-http_static-timer_remove.log>
> On Thu, Jul 23, 2020 at 2:47 PM Ivan Shvedunov via lists.fd.io 
> <http://lists.fd.io/> <ivan4th=gmail....@lists.fd.io 
> <mailto:gmail....@lists.fd.io>> wrote:
> Hi,
> I've found a problem with the timer fix and commented in Gerrit [1] 
> accordingly.
> Basically this change [2] makes the tcp_prepare_retransmit_segment() issue go 
> away for me.
> 
> Concerning the proxy example, I can no longer see the SVM FIFO crashes, but 
> when using debug build, VPP crashes with this error (full log [3]) during my 
> test:
> /usr/bin/vpp[39]: /src/vpp/src/vnet/tcp/tcp_input.c:2857 (tcp46_input_inline) 
> assertion `tcp_lookup_is_valid (tc1, b[1], tcp_buffer_hdr (b[1]))' fails
> 
> When using release build, it produces a lot of messages like this instead:
> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 15168 disp error 
> state CLOSE_WAIT flags 0x02 SYN
> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 9417 disp error 
> state FIN_WAIT_2 flags 0x12 SYN ACK
> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 10703 disp error 
> state TIME_WAIT flags 0x12 SYN ACK
> 
> and also
> 
> /usr/bin/vpp[39]: active_open_connected_callback:439: connection 85557 failed!
> 
> [1] https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39 
> <https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39>
> [2] 
> https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca
>  
> <https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca>
> [3] 
> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71
>  
> <https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71>
> 
> I will look into src/vcl/test/test_vcl.py to see if I can reproduce something 
> like my test there, thanks!
> And waiting for Dave's input concerning the CSIT part, too, of course.
> 
> 
> On Thu, Jul 23, 2020 at 5:22 AM Florin Coras <fcoras.li...@gmail.com 
> <mailto:fcoras.li...@gmail.com>> wrote:
> Hi Ivan, 
> 
> Thanks for the test. After modifying it a bit to run straight from binaries, 
> I managed to repro the issue. As expected, the proxy is not cleaning up the 
> sessions correctly (example apps do run out of sync ..). Here’s a quick patch 
> that solves some of the obvious issues [1] (note that it’s chained with 
> gerrit 27952). I didn’t do too much testing, so let me know if you hit some 
> other problems. As far as I can tell, 27952 is needed. 
> 
> As for the CI, I guess there are two types of tests we might want (cc-ing 
> Dave since he has experience with this):
> - functional test that could live as part of “make test” infra. The host 
> stack already has some functional integration tests, i.e., the vcl tests in 
> src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We could do 
> something   similar for the proxy app, but the tests need to be lightweight 
> as they’re run as part of the verify jobs
> - CSIT scale/performance tests. We could use something like your scripts to 
> test the proxy but also ld_preload + nginx and other applications. Dave 
> should have more opinions here :-)
> 
> Regards, 
> Florin 
> 
> [1] https://gerrit.fd.io/r/c/vpp/+/28041 
> <https://gerrit.fd.io/r/c/vpp/+/28041>
> 
>> On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com 
>> <mailto:ivan...@gmail.com>> wrote:
>> 
>> Concerning the CI: I'd be glad to add that test to "make test", but not sure 
>> how to approach it. The test is not about containers but more about using 
>> network namespaces and some tools like wrk to create a lot of TCP 
>> connections to do some "stress testing" of VPP host stack (and as it was 
>> noted, it fails not only on the proxy example, but also on http_static 
>> plugin). It's probably doable w/o any external tooling at all, and even 
>> without the network namespaces either, using only VPP's own TCP stack, but 
>> that is probably rather hard. Could you suggest some ideas how it could be 
>> added to "make test"? Should I add a `test_....py` under `tests/` that 
>> creates host interfaces in VPP and uses these via OS networking instead of 
>> the packet generator? As far as I can see there's something like that in 
>> srv6-mobile plugin [1].
>> 
>> [1] 
>> https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125
>>  
>> <https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125>
>> On Wed, Jul 22, 2020 at 8:25 PM Florin Coras <fcoras.li...@gmail.com 
>> <mailto:fcoras.li...@gmail.com>> wrote:
>> I missed the point about the CI in my other reply. If we can somehow 
>> integrate some container based tests into the “make test” infra, I wouldn’t 
>> mind at all! :-)
>> 
>> Regards,
>> Florin
>> 
>>> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com 
>>> <mailto:ivan...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> sadly the patch apparently didn't work. It should have worked but for some 
>>> reason it didn't ...
>>> 
>>> On the bright side, I've made a test case [1] using fresh upstream VPP code 
>>> with no UPF that reproduces the issues I mentioned, including both timer 
>>> and TCP retransmit one along with some other possible problems using 
>>> http_static plugin and the proxy example, along with nginx (with proxy) and 
>>> wrk.
>>> 
>>> It is docker-based, but the main scripts (start.sh and test.sh) can be used 
>>> without Docker, too.
>>> I've used our own Dockerfiles to build the images, but I'm not sure if that 
>>> makes any difference. 
>>> I've added some log files resulting from the runs that crashed in different 
>>> places. For me, the tests crash on each run, but in different places.
>>> 
>>> The TCP retransmit problem happens with http_static when using the debug 
>>> build. When using release build, some unrelated crash in timer_remove() 
>>> happens instead.
>>> The SVM FIFO crash happens when using the proxy. It can happen with both 
>>> release and debug builds.
>>> 
>>> Please see the repo [1] for details and crash logs.
>>> 
>>> [1] https://github.com/ivan4th/vpp-tcp-test 
>>> <https://github.com/ivan4th/vpp-tcp-test>
>>> 
>>> P.S. As the tests do expose some problems with VPP host stack and some of 
>>> the VPP plugins/examples, maybe we should consider adding them to the VPP 
>>> CI, too?
>>> 
>>> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <fcoras.li...@gmail.com 
>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>> Hi Ivan, 
>>> 
>>> Thanks for the detailed report!
>>> 
>>> I assume this is a situation where most of the connections time out and the 
>>> rate limiting we apply on the pending timer queue delays handling for long 
>>> enough to be in a situation like the one you described. Here’s a draft 
>>> patch that starts tracking pending timers [1]. Let me know if it solves the 
>>> first problem. 
>>> 
>>> Regarding the second, it looks like the first chunk in the fifo is not 
>>> properly initialized/corrupted. It’s hard to tell what leads to that given 
>>> that I haven’t seen this sort of issues even with larger number of 
>>> connections. You could maybe try calling svm_fifo_is_sane() in the 
>>> enqueue/dequeue functions, or after the proxy allocates/shares the fifos to 
>>> catch the issue as early as possible. 
>>> 
>>> Regards, 
>>> Florin
>>> 
>>> [1] https://gerrit.fd.io/r/c/vpp/+/27952 
>>> <https://gerrit.fd.io/r/c/vpp/+/27952>
>>> 
>>>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com <mailto:ivan...@gmail.com> 
>>>> wrote:
>>>> 
>>>>   Hi,
>>>>   I'm working on the Travelping UPF project 
>>>> https://github.com/travelping/vpp  <https://github.com/travelping/vpp>For 
>>>> variety of reasons, it's presently maintained as a fork of UPF that's 
>>>> rebased on top of upstream master from time to time, but really it's just 
>>>> a plugin. During 40K TCP connection test with netem, I found an issue with 
>>>> TCP timer race (timers firing after tcp_timer_reset() was called for them) 
>>>> which I tried to work around only to stumble into another crash, which I'm 
>>>> presently debugging (an SVM FIFO bug, possibly) but maybe some of you 
>>>> folks have some ideas what it could be.
>>>>   I've described my findings in this JIRA ticket: 
>>>> https://jira.fd.io/browse/VPP-1923 <https://jira.fd.io/browse/VPP-1923>
>>>>   Although the last upstream commit UPF is presently based on 
>>>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1) was some time ago, I believe  
>>>> the problems are still relevant as there were no changes in these parts of 
>>>> code in master since that commit. 
>>> 
>>> 
>>> 
>>> -- 
>>> Ivan Shvedunov <ivan...@gmail.com <mailto:ivan...@gmail.com>>
>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807
>> 
>> 
>> 
>> -- 
>> Ivan Shvedunov <ivan...@gmail.com <mailto:ivan...@gmail.com>>
>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807
> 
> 
> 
> -- 
> Ivan Shvedunov <ivan...@gmail.com <mailto:ivan...@gmail.com>>
> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807
> 
> 
> 
> -- 
> Ivan Shvedunov <ivan...@gmail.com <mailto:ivan...@gmail.com>>
> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17066): https://lists.fd.io/g/vpp-dev/message/17066
Mute This Topic: https://lists.fd.io/mt/75537746/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to