Ah, I didn’t try running test.sh 80. The only difference in how I’m running the 
test is that I start vpp outside of start.sh straight from binaries. 

Regards,
Florin

> On Jul 23, 2020, at 8:22 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
> 
> Well, I always run the same test, the difference being only
> "test.sh 80" for http_static (it's configured to be listening on that port)
> or just "test.sh" for the proxy. As far as I understand, you run the tests 
> without using the containers, does that include setting up netem like this 
> [1] ?
> 
> [1] https://github.com/ivan4th/vpp-tcp-test/blob/a3b02ec/start.sh#L34-L35 
> <https://github.com/ivan4th/vpp-tcp-test/blob/a3b02ec/start.sh#L34-L35>
> On Thu, Jul 23, 2020 at 5:10 PM Florin Coras <fcoras.li...@gmail.com 
> <mailto:fcoras.li...@gmail.com>> wrote:
> Hi Ivan, 
> 
> Updated [1] but I’m not seeing [3] after several test iterations. 
> 
> Probably the static server needs the same treatment as the proxy. Are you 
> running a slightly different test? All of the builtin apps have the potential 
> to crash vpp or leave the host stack in an unwanted state since they run 
> inline. 
> 
> Either way, to solve this, first step would be to get rid of error like, “no 
> http session for thread 0 session_index x”. Will eventually try to look into 
> it if nobody beats me to it. 
> 
> Regards,
> Florin
> 
>> On Jul 23, 2020, at 4:59 AM, Ivan Shvedunov <ivan...@gmail.com 
>> <mailto:ivan...@gmail.com>> wrote:
>> 
>> http_static produces some errors:
>> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http session 
>> for thread 0 session_index 4124
>> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http session 
>> for thread 0 session_index 4124
>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp error 
>> state CLOSE_WAIT flags 0x02 SYN
>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp error 
>> state CLOSE_WAIT flags 0x02 SYN
>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13350 disp error 
>> state CLOSE_WAIT flags 0x02 SYN
>> 
>> also with multiple different CP connection states related to connections 
>> being closed and receiving SYN / SYN+ACK.
>> The release build crashes (did already happen before, so it's unrelated to 
>> any of the fixes [1]):
>> 
>> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header!
>> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header!
>> 
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828, elt=<optimized out>)
>>     at /src/vpp/src/vppinfra/tw_timer_template.c:154
>> 154     /src/vpp/src/vppinfra/tw_timer_template.c: No such file or directory.
>> #0  0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828,
>>     elt=<optimized out>) at /src/vpp/src/vppinfra/tw_timer_template.c:154
>> #1  tw_timer_stop_2t_1w_2048sl (
>>     tw=0x7fffb0967728 <http_static_server_main+288>, handle=7306)
>>     at /src/vpp/src/vppinfra/tw_timer_template.c:374
>> ---Type <return> to continue, or q <return> to quit---
>> #2  0x00007fffb076146f in http_static_server_session_timer_stop 
>> (hs=<optimized out>)
>>     at /src/vpp/src/plugins/http_static/static_server.c:126
>> #3  http_static_server_rx_tx_callback (s=0x7fffb5e13a40, cf=CALLED_FROM_RX)
>>     at /src/vpp/src/plugins/http_static/static_server.c:1026
>> #4  0x00007fffb0760eb8 in http_static_server_rx_callback (
>>     s=0x7fffb0967728 <http_static_server_main+288>)
>>     at /src/vpp/src/plugins/http_static/static_server.c:1037
>> #5  0x00007ffff774a9de in app_worker_builtin_rx (app_wrk=<optimized out>, 
>> s=0x7fffb5e13a40)
>>     at /src/vpp/src/vnet/session/application_worker.c:485
>> #6  app_send_io_evt_rx (app_wrk=<optimized out>, s=0x7fffb5e13a40)
>>     at /src/vpp/src/vnet/session/application_worker.c:691
>> #7  0x00007ffff7713d9a in session_enqueue_notify_inline (s=0x7fffb5e13a40)
>>     at /src/vpp/src/vnet/session/session.c:632
>> #8  0x00007ffff7713fd1 in session_main_flush_enqueue_events 
>> (transport_proto=<optimized out>,
>>     thread_index=0) at /src/vpp/src/vnet/session/session.c:736
>> #9  0x00007ffff63960e9 in tcp46_established_inline (vm=0x7ffff5ddc6c0 
>> <vlib_global_main>,
>>     node=<optimized out>, frame=<optimized out>, is_ip4=1) at 
>> /src/vpp/src/vnet/tcp/tcp_input.c:1558
>> #10 tcp4_established_node_fn_hsw (vm=0x7ffff5ddc6c0 <vlib_global_main>, 
>> node=<optimized out>,
>>     from_frame=0x7fffb5458480) at /src/vpp/src/vnet/tcp/tcp_input.c:1573
>> #11 0x00007ffff5b5f509 in dispatch_node (vm=0x7ffff5ddc6c0 
>> <vlib_global_main>, node=0x7fffb4baf400,
>>     type=VLIB_NODE_TYPE_INTERNAL, dispatch_state=VLIB_NODE_STATE_POLLING, 
>> frame=<optimized out>,
>>     last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1194
>> #12 dispatch_pending_node (vm=0x7ffff5ddc6c0 <vlib_global_main>, 
>> pending_frame_index=<optimized out>,
>>     last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1353
>> #13 vlib_main_or_worker_loop (vm=<optimized out>, is_main=1) at 
>> /src/vpp/src/vlib/main.c:1848
>> #14 vlib_main_loop (vm=<optimized out>) at /src/vpp/src/vlib/main.c:1976
>> #15 0x00007ffff5b5daf0 in vlib_main (vm=0x7ffff5ddc6c0 <vlib_global_main>, 
>> input=0x7fffb4762fb0)
>>     at /src/vpp/src/vlib/main.c:2222
>> #16 0x00007ffff5bc2816 in thread0 (arg=140737318340288) at 
>> /src/vpp/src/vlib/unix/main.c:660
>> #17 0x00007ffff4fa9ec4 in clib_calljmp () from 
>> /usr/lib/x86_64-linux-gnu/libvppinfra.so.20.09
>> #18 0x00007fffffffd8b0 in ?? ()
>> #19 0x00007ffff5bc27c8 in vlib_unix_main (argc=<optimized out>, 
>> argv=<optimized out>)
>>     at /src/vpp/src/vlib/unix/main.c:733
>> 
>> [1] 
>> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-release-http_static-timer_remove.log
>>  
>> <https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-release-http_static-timer_remove.log>
>> On Thu, Jul 23, 2020 at 2:47 PM Ivan Shvedunov via lists.fd.io 
>> <http://lists.fd.io/> <ivan4th=gmail....@lists.fd.io 
>> <mailto:gmail....@lists.fd.io>> wrote:
>> Hi,
>> I've found a problem with the timer fix and commented in Gerrit [1] 
>> accordingly.
>> Basically this change [2] makes the tcp_prepare_retransmit_segment() issue 
>> go away for me.
>> 
>> Concerning the proxy example, I can no longer see the SVM FIFO crashes, but 
>> when using debug build, VPP crashes with this error (full log [3]) during my 
>> test:
>> /usr/bin/vpp[39]: /src/vpp/src/vnet/tcp/tcp_input.c:2857 
>> (tcp46_input_inline) assertion `tcp_lookup_is_valid (tc1, b[1], 
>> tcp_buffer_hdr (b[1]))' fails
>> 
>> When using release build, it produces a lot of messages like this instead:
>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 15168 disp error 
>> state CLOSE_WAIT flags 0x02 SYN
>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 9417 disp error 
>> state FIN_WAIT_2 flags 0x12 SYN ACK
>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 10703 disp error 
>> state TIME_WAIT flags 0x12 SYN ACK
>> 
>> and also
>> 
>> /usr/bin/vpp[39]: active_open_connected_callback:439: connection 85557 
>> failed!
>> 
>> [1] https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39 
>> <https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39>
>> [2] 
>> https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca
>>  
>> <https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca>
>> [3] 
>> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71
>>  
>> <https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71>
>> 
>> I will look into src/vcl/test/test_vcl.py to see if I can reproduce 
>> something like my test there, thanks!
>> And waiting for Dave's input concerning the CSIT part, too, of course.
>> 
>> 
>> On Thu, Jul 23, 2020 at 5:22 AM Florin Coras <fcoras.li...@gmail.com 
>> <mailto:fcoras.li...@gmail.com>> wrote:
>> Hi Ivan, 
>> 
>> Thanks for the test. After modifying it a bit to run straight from binaries, 
>> I managed to repro the issue. As expected, the proxy is not cleaning up the 
>> sessions correctly (example apps do run out of sync ..). Here’s a quick 
>> patch that solves some of the obvious issues [1] (note that it’s chained 
>> with gerrit 27952). I didn’t do too much testing, so let me know if you hit 
>> some other problems. As far as I can tell, 27952 is needed. 
>> 
>> As for the CI, I guess there are two types of tests we might want (cc-ing 
>> Dave since he has experience with this):
>> - functional test that could live as part of “make test” infra. The host 
>> stack already has some functional integration tests, i.e., the vcl tests in 
>> src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We could do 
>> something   similar for the proxy app, but the tests need to be lightweight 
>> as they’re run as part of the verify jobs
>> - CSIT scale/performance tests. We could use something like your scripts to 
>> test the proxy but also ld_preload + nginx and other applications. Dave 
>> should have more opinions here :-)
>> 
>> Regards, 
>> Florin 
>> 
>> [1] https://gerrit.fd.io/r/c/vpp/+/28041 
>> <https://gerrit.fd.io/r/c/vpp/+/28041>
>> 
>>> On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com 
>>> <mailto:ivan...@gmail.com>> wrote:
>>> 
>>> Concerning the CI: I'd be glad to add that test to "make test", but not 
>>> sure how to approach it. The test is not about containers but more about 
>>> using network namespaces and some tools like wrk to create a lot of TCP 
>>> connections to do some "stress testing" of VPP host stack (and as it was 
>>> noted, it fails not only on the proxy example, but also on http_static 
>>> plugin). It's probably doable w/o any external tooling at all, and even 
>>> without the network namespaces either, using only VPP's own TCP stack, but 
>>> that is probably rather hard. Could you suggest some ideas how it could be 
>>> added to "make test"? Should I add a `test_....py` under `tests/` that 
>>> creates host interfaces in VPP and uses these via OS networking instead of 
>>> the packet generator? As far as I can see there's something like that in 
>>> srv6-mobile plugin [1].
>>> 
>>> [1] 
>>> https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125
>>>  
>>> <https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125>
>>> On Wed, Jul 22, 2020 at 8:25 PM Florin Coras <fcoras.li...@gmail.com 
>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>> I missed the point about the CI in my other reply. If we can somehow 
>>> integrate some container based tests into the “make test” infra, I wouldn’t 
>>> mind at all! :-)
>>> 
>>> Regards,
>>> Florin
>>> 
>>>> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com 
>>>> <mailto:ivan...@gmail.com>> wrote:
>>>> 
>>>> Hi,
>>>> sadly the patch apparently didn't work. It should have worked but for some 
>>>> reason it didn't ...
>>>> 
>>>> On the bright side, I've made a test case [1] using fresh upstream VPP 
>>>> code with no UPF that reproduces the issues I mentioned, including both 
>>>> timer and TCP retransmit one along with some other possible problems using 
>>>> http_static plugin and the proxy example, along with nginx (with proxy) 
>>>> and wrk.
>>>> 
>>>> It is docker-based, but the main scripts (start.sh and test.sh) can be 
>>>> used without Docker, too.
>>>> I've used our own Dockerfiles to build the images, but I'm not sure if 
>>>> that makes any difference. 
>>>> I've added some log files resulting from the runs that crashed in 
>>>> different places. For me, the tests crash on each run, but in different 
>>>> places.
>>>> 
>>>> The TCP retransmit problem happens with http_static when using the debug 
>>>> build. When using release build, some unrelated crash in timer_remove() 
>>>> happens instead.
>>>> The SVM FIFO crash happens when using the proxy. It can happen with both 
>>>> release and debug builds.
>>>> 
>>>> Please see the repo [1] for details and crash logs.
>>>> 
>>>> [1] https://github.com/ivan4th/vpp-tcp-test 
>>>> <https://github.com/ivan4th/vpp-tcp-test>
>>>> 
>>>> P.S. As the tests do expose some problems with VPP host stack and some of 
>>>> the VPP plugins/examples, maybe we should consider adding them to the VPP 
>>>> CI, too?
>>>> 
>>>> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <fcoras.li...@gmail.com 
>>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>>> Hi Ivan, 
>>>> 
>>>> Thanks for the detailed report!
>>>> 
>>>> I assume this is a situation where most of the connections time out and 
>>>> the rate limiting we apply on the pending timer queue delays handling for 
>>>> long enough to be in a situation like the one you described. Here’s a 
>>>> draft patch that starts tracking pending timers [1]. Let me know if it 
>>>> solves the first problem. 
>>>> 
>>>> Regarding the second, it looks like the first chunk in the fifo is not 
>>>> properly initialized/corrupted. It’s hard to tell what leads to that given 
>>>> that I haven’t seen this sort of issues even with larger number of 
>>>> connections. You could maybe try calling svm_fifo_is_sane() in the 
>>>> enqueue/dequeue functions, or after the proxy allocates/shares the fifos 
>>>> to catch the issue as early as possible. 
>>>> 
>>>> Regards, 
>>>> Florin
>>>> 
>>>> [1] https://gerrit.fd.io/r/c/vpp/+/27952 
>>>> <https://gerrit.fd.io/r/c/vpp/+/27952>
>>>> 
>>>>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com <mailto:ivan...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>>   Hi,
>>>>>   I'm working on the Travelping UPF project 
>>>>> https://github.com/travelping/vpp  <https://github.com/travelping/vpp>For 
>>>>> variety of reasons, it's presently maintained as a fork of UPF that's 
>>>>> rebased on top of upstream master from time to time, but really it's just 
>>>>> a plugin. During 40K TCP connection test with netem, I found an issue 
>>>>> with TCP timer race (timers firing after tcp_timer_reset() was called for 
>>>>> them) which I tried to work around only to stumble into another crash, 
>>>>> which I'm presently debugging (an SVM FIFO bug, possibly) but maybe some 
>>>>> of you folks have some ideas what it could be.
>>>>>   I've described my findings in this JIRA ticket: 
>>>>> https://jira.fd.io/browse/VPP-1923 <https://jira.fd.io/browse/VPP-1923>
>>>>>   Although the last upstream commit UPF is presently based on 
>>>>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1) was some time ago, I believe  
>>>>> the problems are still relevant as there were no changes in these parts 
>>>>> of code in master since that commit. 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Ivan Shvedunov <ivan...@gmail.com <mailto:ivan...@gmail.com>>
>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 
>>>> 2807
>>> 
>>> 
>>> 
>>> -- 
>>> Ivan Shvedunov <ivan...@gmail.com <mailto:ivan...@gmail.com>>
>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807
>> 
>> 
>> 
>> -- 
>> Ivan Shvedunov <ivan...@gmail.com <mailto:ivan...@gmail.com>>
>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807
>> 
>> 
>> 
>> -- 
>> Ivan Shvedunov <ivan...@gmail.com <mailto:ivan...@gmail.com>>
>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807
> 
> 
> 
> -- 
> Ivan Shvedunov <ivan...@gmail.com <mailto:ivan...@gmail.com>>
> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17068): https://lists.fd.io/g/vpp-dev/message/17068
Mute This Topic: https://lists.fd.io/mt/75537746/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to