Hi Florin, I re-verified the patches and the modified patch doesn't crash either, so I think it's safe to merge it. Thanks!
I will try to see what is the remaining problem with http_static On Fri, Jul 24, 2020 at 8:15 PM Florin Coras <fcoras.li...@gmail.com> wrote: > Hi Ivan, > > Adding Vanessa to see if she can help with the account issues. > > Thanks a lot for the patches! Pushed them here [1] and [2]. I took the > liberty of slightly changing [2], so if you get a chance, do try it out > again. > > Finally, the static http server still needs fixes. Most probably it > mishandles cleanups and corrupts the sessions/connections under heavy load. > > Regards, > Florin > > [1] https://gerrit.fd.io/r/c/vpp/+/28063 > [2] https://gerrit.fd.io/r/c/vpp/+/28064 > > > On Jul 24, 2020, at 9:27 AM, Ivan Shvedunov <ivan...@gmail.com> wrote: > > I did a bit more debugging and found an issue that was causing invalid TCP > connection lookups. > Basically, if session_connected_callback was failing for an app (in case > of proxy, e.g. b/c the other corresponding connection got closed), it was > leaving an invalid entry in the session lookup table. > Another less serious issue that I found while debugging this issue is that > when TCP elog is enabled in tcp_debug.h, the TCP stack crashes upon > accepting connections because of misplaced TCP_EVT. > After fixing these issues, my test no longer crashes with the proxy > example. That being said, http_static plugin still crashes with the timer > error. > > I tried to submit both patches to fd.io Gerrit, but as I didn't submit > any CRs to this Gerrit before, I ran into a problem with my old email in > Linux Foundation account settings. I've fixed that and then updated the > email address in fd.io Gerrit too, but still getting the same error: > remote: ERROR: commit e98791a: email address ivan...@gmail.com is not > registered in your account, and you lack 'forge committer' permission. > > Because of these issues I'm attaching both patches here for now. > > On Thu, Jul 23, 2020 at 6:47 PM Florin Coras <fcoras.li...@gmail.com> > wrote: > >> Ah, I didn’t try running test.sh 80. The only difference in how I’m >> running the test is that I start vpp outside of start.sh straight from >> binaries. >> >> Regards, >> Florin >> >> On Jul 23, 2020, at 8:22 AM, Ivan Shvedunov <ivan...@gmail.com> wrote: >> >> Well, I always run the same test, the difference being only >> "test.sh 80" for http_static (it's configured to be listening on that >> port) >> or just "test.sh" for the proxy. As far as I understand, you run the >> tests without using the containers, does that include setting up netem like >> this [1] ? >> >> [1] https://github.com/ivan4th/vpp-tcp-test/blob/a3b02ec/start.sh#L34-L35 >> >> On Thu, Jul 23, 2020 at 5:10 PM Florin Coras <fcoras.li...@gmail.com> >> wrote: >> >>> Hi Ivan, >>> >>> Updated [1] but I’m not seeing [3] after several test iterations. >>> >>> Probably the static server needs the same treatment as the proxy. Are >>> you running a slightly different test? All of the builtin apps have the >>> potential to crash vpp or leave the host stack in an unwanted state since >>> they run inline. >>> >>> Either way, to solve this, first step would be to get rid of error like, >>> “no http session for thread 0 session_index x”. Will eventually try to look >>> into it if nobody beats me to it. >>> >>> Regards, >>> Florin >>> >>> On Jul 23, 2020, at 4:59 AM, Ivan Shvedunov <ivan...@gmail.com> wrote: >>> >>> http_static produces some errors: >>> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http >>> session for thread 0 session_index 4124 >>> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http >>> session for thread 0 session_index 4124 >>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp >>> error state CLOSE_WAIT flags 0x02 SYN >>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp >>> error state CLOSE_WAIT flags 0x02 SYN >>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13350 disp >>> error state CLOSE_WAIT flags 0x02 SYN >>> >>> also with multiple different CP connection states related to connections >>> being closed and receiving SYN / SYN+ACK. >>> The release build crashes (did already happen before, so it's unrelated >>> to any of the fixes [1]): >>> >>> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header! >>> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header! >>> >>> Program received signal SIGSEGV, Segmentation fault. >>> 0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828, elt=<optimized >>> out>) >>> at /src/vpp/src/vppinfra/tw_timer_template.c:154 >>> 154 /src/vpp/src/vppinfra/tw_timer_template.c: No such file or >>> directory. >>> #0 0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828, >>> elt=<optimized out>) at /src/vpp/src/vppinfra/tw_timer_template.c:154 >>> #1 tw_timer_stop_2t_1w_2048sl ( >>> tw=0x7fffb0967728 <http_static_server_main+288>, handle=7306) >>> at /src/vpp/src/vppinfra/tw_timer_template.c:374 >>> ---Type <return> to continue, or q <return> to quit--- >>> #2 0x00007fffb076146f in http_static_server_session_timer_stop >>> (hs=<optimized out>) >>> at /src/vpp/src/plugins/http_static/static_server.c:126 >>> #3 http_static_server_rx_tx_callback (s=0x7fffb5e13a40, >>> cf=CALLED_FROM_RX) >>> at /src/vpp/src/plugins/http_static/static_server.c:1026 >>> #4 0x00007fffb0760eb8 in http_static_server_rx_callback ( >>> s=0x7fffb0967728 <http_static_server_main+288>) >>> at /src/vpp/src/plugins/http_static/static_server.c:1037 >>> #5 0x00007ffff774a9de in app_worker_builtin_rx (app_wrk=<optimized >>> out>, s=0x7fffb5e13a40) >>> at /src/vpp/src/vnet/session/application_worker.c:485 >>> #6 app_send_io_evt_rx (app_wrk=<optimized out>, s=0x7fffb5e13a40) >>> at /src/vpp/src/vnet/session/application_worker.c:691 >>> #7 0x00007ffff7713d9a in session_enqueue_notify_inline >>> (s=0x7fffb5e13a40) >>> at /src/vpp/src/vnet/session/session.c:632 >>> #8 0x00007ffff7713fd1 in session_main_flush_enqueue_events >>> (transport_proto=<optimized out>, >>> thread_index=0) at /src/vpp/src/vnet/session/session.c:736 >>> #9 0x00007ffff63960e9 in tcp46_established_inline (vm=0x7ffff5ddc6c0 >>> <vlib_global_main>, >>> node=<optimized out>, frame=<optimized out>, is_ip4=1) at >>> /src/vpp/src/vnet/tcp/tcp_input.c:1558 >>> #10 tcp4_established_node_fn_hsw (vm=0x7ffff5ddc6c0 <vlib_global_main>, >>> node=<optimized out>, >>> from_frame=0x7fffb5458480) at /src/vpp/src/vnet/tcp/tcp_input.c:1573 >>> #11 0x00007ffff5b5f509 in dispatch_node (vm=0x7ffff5ddc6c0 >>> <vlib_global_main>, node=0x7fffb4baf400, >>> type=VLIB_NODE_TYPE_INTERNAL, >>> dispatch_state=VLIB_NODE_STATE_POLLING, frame=<optimized out>, >>> last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1194 >>> #12 dispatch_pending_node (vm=0x7ffff5ddc6c0 <vlib_global_main>, >>> pending_frame_index=<optimized out>, >>> last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1353 >>> #13 vlib_main_or_worker_loop (vm=<optimized out>, is_main=1) at >>> /src/vpp/src/vlib/main.c:1848 >>> #14 vlib_main_loop (vm=<optimized out>) at /src/vpp/src/vlib/main.c:1976 >>> #15 0x00007ffff5b5daf0 in vlib_main (vm=0x7ffff5ddc6c0 >>> <vlib_global_main>, input=0x7fffb4762fb0) >>> at /src/vpp/src/vlib/main.c:2222 >>> #16 0x00007ffff5bc2816 in thread0 (arg=140737318340288) at >>> /src/vpp/src/vlib/unix/main.c:660 >>> #17 0x00007ffff4fa9ec4 in clib_calljmp () from >>> /usr/lib/x86_64-linux-gnu/libvppinfra.so.20.09 >>> #18 0x00007fffffffd8b0 in ?? () >>> #19 0x00007ffff5bc27c8 in vlib_unix_main (argc=<optimized out>, >>> argv=<optimized out>) >>> at /src/vpp/src/vlib/unix/main.c:733 >>> >>> [1] >>> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-release-http_static-timer_remove.log >>> >>> On Thu, Jul 23, 2020 at 2:47 PM Ivan Shvedunov via lists.fd.io <ivan4th= >>> gmail....@lists.fd.io> wrote: >>> >>>> Hi, >>>> I've found a problem with the timer fix and commented in Gerrit [1] >>>> accordingly. >>>> Basically this change [2] makes the tcp_prepare_retransmit_segment() >>>> issue go away for me. >>>> >>>> Concerning the proxy example, I can no longer see the SVM FIFO crashes, >>>> but when using debug build, VPP crashes with this error (full log [3]) >>>> during my test: >>>> /usr/bin/vpp[39]: /src/vpp/src/vnet/tcp/tcp_input.c:2857 >>>> (tcp46_input_inline) assertion `tcp_lookup_is_valid (tc1, b[1], >>>> tcp_buffer_hdr (b[1]))' fails >>>> >>>> When using release build, it produces a lot of messages like this >>>> instead: >>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 15168 disp >>>> error state CLOSE_WAIT flags 0x02 SYN >>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 9417 disp >>>> error state FIN_WAIT_2 flags 0x12 SYN ACK >>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 10703 disp >>>> error state TIME_WAIT flags 0x12 SYN ACK >>>> >>>> and also >>>> >>>> /usr/bin/vpp[39]: active_open_connected_callback:439: connection 85557 >>>> failed! >>>> >>>> [1] https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39 >>>> [2] >>>> https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca >>>> [3] >>>> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71 >>>> >>>> I will look into src/vcl/test/test_vcl.py to see if I can reproduce >>>> something like my test there, thanks! >>>> And waiting for Dave's input concerning the CSIT part, too, of course. >>>> >>>> >>>> On Thu, Jul 23, 2020 at 5:22 AM Florin Coras <fcoras.li...@gmail.com> >>>> wrote: >>>> >>>>> Hi Ivan, >>>>> >>>>> Thanks for the test. After modifying it a bit to run straight from >>>>> binaries, I managed to repro the issue. As expected, the proxy is not >>>>> cleaning up the sessions correctly (example apps do run out of sync ..). >>>>> Here’s a quick patch that solves some of the obvious issues [1] (note that >>>>> it’s chained with gerrit 27952). I didn’t do too much testing, so let me >>>>> know if you hit some other problems. As far as I can tell, 27952 is >>>>> needed. >>>>> >>>>> As for the CI, I guess there are two types of tests we might want >>>>> (cc-ing Dave since he has experience with this): >>>>> - functional test that could live as part of “make test” infra. The >>>>> host stack already has some functional integration tests, i.e., the vcl >>>>> tests in src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We >>>>> could >>>>> do something similar for the proxy app, but the tests need to be >>>>> lightweight as they’re run as part of the verify jobs >>>>> - CSIT scale/performance tests. We could use something like your >>>>> scripts to test the proxy but also ld_preload + nginx and other >>>>> applications. Dave should have more opinions here :-) >>>>> >>>>> Regards, >>>>> Florin >>>>> >>>>> [1] https://gerrit.fd.io/r/c/vpp/+/28041 >>>>> >>>>> On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com> wrote: >>>>> >>>>> Concerning the CI: I'd be glad to add that test to "make test", but >>>>> not sure how to approach it. The test is not about containers but more >>>>> about using network namespaces and some tools like wrk to create a lot of >>>>> TCP connections to do some "stress testing" of VPP host stack (and as it >>>>> was noted, it fails not only on the proxy example, but also on http_static >>>>> plugin). It's probably doable w/o any external tooling at all, and even >>>>> without the network namespaces either, using only VPP's own TCP stack, but >>>>> that is probably rather hard. Could you suggest some ideas how it could be >>>>> added to "make test"? Should I add a `test_....py` under `tests/` that >>>>> creates host interfaces in VPP and uses these via OS networking instead of >>>>> the packet generator? As far as I can see there's something like that in >>>>> srv6-mobile plugin [1]. >>>>> >>>>> [1] >>>>> https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125 >>>>> >>>>> On Wed, Jul 22, 2020 at 8:25 PM Florin Coras <fcoras.li...@gmail.com> >>>>> wrote: >>>>> >>>>>> I missed the point about the CI in my other reply. If we can somehow >>>>>> integrate some container based tests into the “make test” infra, I >>>>>> wouldn’t >>>>>> mind at all! :-) >>>>>> >>>>>> Regards, >>>>>> Florin >>>>>> >>>>>> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> sadly the patch apparently didn't work. It should have worked but for >>>>>> some reason it didn't ... >>>>>> >>>>>> On the bright side, I've made a test case [1] using fresh upstream >>>>>> VPP code with no UPF that reproduces the issues I mentioned, including >>>>>> both >>>>>> timer and TCP retransmit one along with some other possible problems >>>>>> using >>>>>> http_static plugin and the proxy example, along with nginx (with proxy) >>>>>> and >>>>>> wrk. >>>>>> >>>>>> It is docker-based, but the main scripts (start.sh and test.sh) can >>>>>> be used without Docker, too. >>>>>> I've used our own Dockerfiles to build the images, but I'm not sure >>>>>> if that makes any difference. >>>>>> I've added some log files resulting from the runs that crashed in >>>>>> different places. For me, the tests crash on each run, but in different >>>>>> places. >>>>>> >>>>>> The TCP retransmit problem happens with http_static when using the >>>>>> debug build. When using release build, some unrelated crash >>>>>> in timer_remove() happens instead. >>>>>> The SVM FIFO crash happens when using the proxy. It can happen with >>>>>> both release and debug builds. >>>>>> >>>>>> Please see the repo [1] for details and crash logs. >>>>>> >>>>>> [1] https://github.com/ivan4th/vpp-tcp-test >>>>>> >>>>>> P.S. As the tests do expose some problems with VPP host stack and >>>>>> some of the VPP plugins/examples, maybe we should consider adding them to >>>>>> the VPP CI, too? >>>>>> >>>>>> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <fcoras.li...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Ivan, >>>>>>> >>>>>>> Thanks for the detailed report! >>>>>>> >>>>>>> I assume this is a situation where most of the connections time out >>>>>>> and the rate limiting we apply on the pending timer queue delays >>>>>>> handling >>>>>>> for long enough to be in a situation like the one you described. Here’s >>>>>>> a >>>>>>> draft patch that starts tracking pending timers [1]. Let me know if it >>>>>>> solves the first problem. >>>>>>> >>>>>>> Regarding the second, it looks like the first chunk in the fifo is >>>>>>> not properly initialized/corrupted. It’s hard to tell what leads to that >>>>>>> given that I haven’t seen this sort of issues even with larger number of >>>>>>> connections. You could maybe try calling svm_fifo_is_sane() in the >>>>>>> enqueue/dequeue functions, or after the proxy allocates/shares the >>>>>>> fifos to >>>>>>> catch the issue as early as possible. >>>>>>> >>>>>>> Regards, >>>>>>> Florin >>>>>>> >>>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/27952 >>>>>>> >>>>>>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com wrote: >>>>>>> >>>>>>> Hi, >>>>>>> I'm working on the Travelping UPF project >>>>>>> https://github.com/travelping/vpp >>>>>>> <https://github.com/travelping/vpp>For variety of reasons, it's >>>>>>> presently maintained as a fork of UPF that's rebased on top of upstream >>>>>>> master from time to time, but really it's just a plugin. During 40K TCP >>>>>>> connection test with netem, I found an issue with TCP timer race (timers >>>>>>> firing after tcp_timer_reset() was called for them) which I tried to >>>>>>> work >>>>>>> around only to stumble into another crash, which I'm presently debugging >>>>>>> (an SVM FIFO bug, possibly) but maybe some of you folks have some ideas >>>>>>> what it could be. >>>>>>> I've described my findings in this JIRA ticket: >>>>>>> https://jira.fd.io/browse/VPP-1923 >>>>>>> Although the last upstream commit UPF is presently based on >>>>>>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1) >>>>>>> was some time ago, I believe the problems are still relevant as >>>>>>> there were no changes in these parts of code in master since that >>>>>>> commit. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ivan Shvedunov <ivan...@gmail.com> >>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 >>>>>> 0BC5 2807 >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ivan Shvedunov <ivan...@gmail.com> >>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 >>>>> 0BC5 2807 >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Ivan Shvedunov <ivan...@gmail.com> >>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 >>>> 2807 >>>> >>>> >>> >>> >>> -- >>> Ivan Shvedunov <ivan...@gmail.com> >>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 >>> 2807 >>> >>> >>> >> >> -- >> Ivan Shvedunov <ivan...@gmail.com> >> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 >> 2807 >> >> >> > > -- > Ivan Shvedunov <ivan...@gmail.com> > ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 > 2807 > <0002-Fix-initializing-elog-track-for-accepted-TCP-connect.patch> > <0001-Fix-bad-session-lookups-after-failed-session_connect.patch> > > > -- Ivan Shvedunov <ivan...@gmail.com> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 2807
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17085): https://lists.fd.io/g/vpp-dev/message/17085 Mute This Topic: https://lists.fd.io/mt/75537746/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-