http_static produces some errors: /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http session for thread 0 session_index 4124 /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http session for thread 0 session_index 4124 /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp error state CLOSE_WAIT flags 0x02 SYN /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp error state CLOSE_WAIT flags 0x02 SYN /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13350 disp error state CLOSE_WAIT flags 0x02 SYN
also with multiple different CP connection states related to connections being closed and receiving SYN / SYN+ACK. The release build crashes (did already happen before, so it's unrelated to any of the fixes [1]): /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header! /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header! Program received signal SIGSEGV, Segmentation fault. 0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828, elt=<optimized out>) at /src/vpp/src/vppinfra/tw_timer_template.c:154 154 /src/vpp/src/vppinfra/tw_timer_template.c: No such file or directory. #0 0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828, elt=<optimized out>) at /src/vpp/src/vppinfra/tw_timer_template.c:154 #1 tw_timer_stop_2t_1w_2048sl ( tw=0x7fffb0967728 <http_static_server_main+288>, handle=7306) at /src/vpp/src/vppinfra/tw_timer_template.c:374 ---Type <return> to continue, or q <return> to quit--- #2 0x00007fffb076146f in http_static_server_session_timer_stop (hs=<optimized out>) at /src/vpp/src/plugins/http_static/static_server.c:126 #3 http_static_server_rx_tx_callback (s=0x7fffb5e13a40, cf=CALLED_FROM_RX) at /src/vpp/src/plugins/http_static/static_server.c:1026 #4 0x00007fffb0760eb8 in http_static_server_rx_callback ( s=0x7fffb0967728 <http_static_server_main+288>) at /src/vpp/src/plugins/http_static/static_server.c:1037 #5 0x00007ffff774a9de in app_worker_builtin_rx (app_wrk=<optimized out>, s=0x7fffb5e13a40) at /src/vpp/src/vnet/session/application_worker.c:485 #6 app_send_io_evt_rx (app_wrk=<optimized out>, s=0x7fffb5e13a40) at /src/vpp/src/vnet/session/application_worker.c:691 #7 0x00007ffff7713d9a in session_enqueue_notify_inline (s=0x7fffb5e13a40) at /src/vpp/src/vnet/session/session.c:632 #8 0x00007ffff7713fd1 in session_main_flush_enqueue_events (transport_proto=<optimized out>, thread_index=0) at /src/vpp/src/vnet/session/session.c:736 #9 0x00007ffff63960e9 in tcp46_established_inline (vm=0x7ffff5ddc6c0 <vlib_global_main>, node=<optimized out>, frame=<optimized out>, is_ip4=1) at /src/vpp/src/vnet/tcp/tcp_input.c:1558 #10 tcp4_established_node_fn_hsw (vm=0x7ffff5ddc6c0 <vlib_global_main>, node=<optimized out>, from_frame=0x7fffb5458480) at /src/vpp/src/vnet/tcp/tcp_input.c:1573 #11 0x00007ffff5b5f509 in dispatch_node (vm=0x7ffff5ddc6c0 <vlib_global_main>, node=0x7fffb4baf400, type=VLIB_NODE_TYPE_INTERNAL, dispatch_state=VLIB_NODE_STATE_POLLING, frame=<optimized out>, last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1194 #12 dispatch_pending_node (vm=0x7ffff5ddc6c0 <vlib_global_main>, pending_frame_index=<optimized out>, last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1353 #13 vlib_main_or_worker_loop (vm=<optimized out>, is_main=1) at /src/vpp/src/vlib/main.c:1848 #14 vlib_main_loop (vm=<optimized out>) at /src/vpp/src/vlib/main.c:1976 #15 0x00007ffff5b5daf0 in vlib_main (vm=0x7ffff5ddc6c0 <vlib_global_main>, input=0x7fffb4762fb0) at /src/vpp/src/vlib/main.c:2222 #16 0x00007ffff5bc2816 in thread0 (arg=140737318340288) at /src/vpp/src/vlib/unix/main.c:660 #17 0x00007ffff4fa9ec4 in clib_calljmp () from /usr/lib/x86_64-linux-gnu/libvppinfra.so.20.09 #18 0x00007fffffffd8b0 in ?? () #19 0x00007ffff5bc27c8 in vlib_unix_main (argc=<optimized out>, argv=<optimized out>) at /src/vpp/src/vlib/unix/main.c:733 [1] https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-release-http_static-timer_remove.log On Thu, Jul 23, 2020 at 2:47 PM Ivan Shvedunov via lists.fd.io <ivan4th= gmail....@lists.fd.io> wrote: > Hi, > I've found a problem with the timer fix and commented in Gerrit [1] > accordingly. > Basically this change [2] makes the tcp_prepare_retransmit_segment() issue > go away for me. > > Concerning the proxy example, I can no longer see the SVM FIFO crashes, > but when using debug build, VPP crashes with this error (full log [3]) > during my test: > /usr/bin/vpp[39]: /src/vpp/src/vnet/tcp/tcp_input.c:2857 > (tcp46_input_inline) assertion `tcp_lookup_is_valid (tc1, b[1], > tcp_buffer_hdr (b[1]))' fails > > When using release build, it produces a lot of messages like this instead: > /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 15168 disp > error state CLOSE_WAIT flags 0x02 SYN > /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 9417 disp error > state FIN_WAIT_2 flags 0x12 SYN ACK > /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 10703 disp > error state TIME_WAIT flags 0x12 SYN ACK > > and also > > /usr/bin/vpp[39]: active_open_connected_callback:439: connection 85557 > failed! > > [1] https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39 > [2] > https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca > [3] > https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71 > > I will look into src/vcl/test/test_vcl.py to see if I can reproduce > something like my test there, thanks! > And waiting for Dave's input concerning the CSIT part, too, of course. > > > On Thu, Jul 23, 2020 at 5:22 AM Florin Coras <fcoras.li...@gmail.com> > wrote: > >> Hi Ivan, >> >> Thanks for the test. After modifying it a bit to run straight from >> binaries, I managed to repro the issue. As expected, the proxy is not >> cleaning up the sessions correctly (example apps do run out of sync ..). >> Here’s a quick patch that solves some of the obvious issues [1] (note that >> it’s chained with gerrit 27952). I didn’t do too much testing, so let me >> know if you hit some other problems. As far as I can tell, 27952 is needed. >> >> As for the CI, I guess there are two types of tests we might want (cc-ing >> Dave since he has experience with this): >> - functional test that could live as part of “make test” infra. The host >> stack already has some functional integration tests, i.e., the vcl tests in >> src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We could do >> something similar for the proxy app, but the tests need to be lightweight >> as they’re run as part of the verify jobs >> - CSIT scale/performance tests. We could use something like your scripts >> to test the proxy but also ld_preload + nginx and other applications. Dave >> should have more opinions here :-) >> >> Regards, >> Florin >> >> [1] https://gerrit.fd.io/r/c/vpp/+/28041 >> >> On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com> wrote: >> >> Concerning the CI: I'd be glad to add that test to "make test", but not >> sure how to approach it. The test is not about containers but more about >> using network namespaces and some tools like wrk to create a lot of TCP >> connections to do some "stress testing" of VPP host stack (and as it was >> noted, it fails not only on the proxy example, but also on http_static >> plugin). It's probably doable w/o any external tooling at all, and even >> without the network namespaces either, using only VPP's own TCP stack, but >> that is probably rather hard. Could you suggest some ideas how it could be >> added to "make test"? Should I add a `test_....py` under `tests/` that >> creates host interfaces in VPP and uses these via OS networking instead of >> the packet generator? As far as I can see there's something like that in >> srv6-mobile plugin [1]. >> >> [1] >> https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125 >> >> On Wed, Jul 22, 2020 at 8:25 PM Florin Coras <fcoras.li...@gmail.com> >> wrote: >> >>> I missed the point about the CI in my other reply. If we can somehow >>> integrate some container based tests into the “make test” infra, I wouldn’t >>> mind at all! :-) >>> >>> Regards, >>> Florin >>> >>> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com> wrote: >>> >>> Hi, >>> sadly the patch apparently didn't work. It should have worked but for >>> some reason it didn't ... >>> >>> On the bright side, I've made a test case [1] using fresh upstream VPP >>> code with no UPF that reproduces the issues I mentioned, including both >>> timer and TCP retransmit one along with some other possible problems using >>> http_static plugin and the proxy example, along with nginx (with proxy) and >>> wrk. >>> >>> It is docker-based, but the main scripts (start.sh and test.sh) can be >>> used without Docker, too. >>> I've used our own Dockerfiles to build the images, but I'm not sure if >>> that makes any difference. >>> I've added some log files resulting from the runs that crashed in >>> different places. For me, the tests crash on each run, but in different >>> places. >>> >>> The TCP retransmit problem happens with http_static when using the debug >>> build. When using release build, some unrelated crash in timer_remove() >>> happens instead. >>> The SVM FIFO crash happens when using the proxy. It can happen with both >>> release and debug builds. >>> >>> Please see the repo [1] for details and crash logs. >>> >>> [1] https://github.com/ivan4th/vpp-tcp-test >>> >>> P.S. As the tests do expose some problems with VPP host stack and some >>> of the VPP plugins/examples, maybe we should consider adding them to the >>> VPP CI, too? >>> >>> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <fcoras.li...@gmail.com> >>> wrote: >>> >>>> Hi Ivan, >>>> >>>> Thanks for the detailed report! >>>> >>>> I assume this is a situation where most of the connections time out and >>>> the rate limiting we apply on the pending timer queue delays handling for >>>> long enough to be in a situation like the one you described. Here’s a draft >>>> patch that starts tracking pending timers [1]. Let me know if it solves the >>>> first problem. >>>> >>>> Regarding the second, it looks like the first chunk in the fifo is not >>>> properly initialized/corrupted. It’s hard to tell what leads to that given >>>> that I haven’t seen this sort of issues even with larger number of >>>> connections. You could maybe try calling svm_fifo_is_sane() in the >>>> enqueue/dequeue functions, or after the proxy allocates/shares the fifos to >>>> catch the issue as early as possible. >>>> >>>> Regards, >>>> Florin >>>> >>>> [1] https://gerrit.fd.io/r/c/vpp/+/27952 >>>> >>>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com wrote: >>>> >>>> Hi, >>>> I'm working on the Travelping UPF project >>>> https://github.com/travelping/vpp <https://github.com/travelping/vpp>For >>>> variety of reasons, it's presently maintained as a fork of UPF that's >>>> rebased on top of upstream master from time to time, but really it's just a >>>> plugin. During 40K TCP connection test with netem, I found an issue with >>>> TCP timer race (timers firing after tcp_timer_reset() was called for them) >>>> which I tried to work around only to stumble into another crash, which I'm >>>> presently debugging (an SVM FIFO bug, possibly) but maybe some of you folks >>>> have some ideas what it could be. >>>> I've described my findings in this JIRA ticket: >>>> https://jira.fd.io/browse/VPP-1923 >>>> Although the last upstream commit UPF is presently based on >>>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1) >>>> was some time ago, I believe the problems are still relevant as there >>>> were no changes in these parts of code in master since that commit. >>>> >>>> >>>> >>> >>> -- >>> Ivan Shvedunov <ivan...@gmail.com> >>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 >>> 2807 >>> >>> >>> >> >> -- >> Ivan Shvedunov <ivan...@gmail.com> >> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 >> 2807 >> >> >> > > -- > Ivan Shvedunov <ivan...@gmail.com> > ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 > 2807 > > -- Ivan Shvedunov <ivan...@gmail.com> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 2807
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17057): https://lists.fd.io/g/vpp-dev/message/17057 Mute This Topic: https://lists.fd.io/mt/75537746/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-