Hi, I've found a problem with the timer fix and commented in Gerrit [1] accordingly. Basically this change [2] makes the tcp_prepare_retransmit_segment() issue go away for me.
Concerning the proxy example, I can no longer see the SVM FIFO crashes, but when using debug build, VPP crashes with this error (full log [3]) during my test: /usr/bin/vpp[39]: /src/vpp/src/vnet/tcp/tcp_input.c:2857 (tcp46_input_inline) assertion `tcp_lookup_is_valid (tc1, b[1], tcp_buffer_hdr (b[1]))' fails When using release build, it produces a lot of messages like this instead: /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 15168 disp error state CLOSE_WAIT flags 0x02 SYN /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 9417 disp error state FIN_WAIT_2 flags 0x12 SYN ACK /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 10703 disp error state TIME_WAIT flags 0x12 SYN ACK and also /usr/bin/vpp[39]: active_open_connected_callback:439: connection 85557 failed! [1] https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39 [2] https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca [3] https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71 I will look into src/vcl/test/test_vcl.py to see if I can reproduce something like my test there, thanks! And waiting for Dave's input concerning the CSIT part, too, of course. On Thu, Jul 23, 2020 at 5:22 AM Florin Coras <fcoras.li...@gmail.com> wrote: > Hi Ivan, > > Thanks for the test. After modifying it a bit to run straight from > binaries, I managed to repro the issue. As expected, the proxy is not > cleaning up the sessions correctly (example apps do run out of sync ..). > Here’s a quick patch that solves some of the obvious issues [1] (note that > it’s chained with gerrit 27952). I didn’t do too much testing, so let me > know if you hit some other problems. As far as I can tell, 27952 is needed. > > As for the CI, I guess there are two types of tests we might want (cc-ing > Dave since he has experience with this): > - functional test that could live as part of “make test” infra. The host > stack already has some functional integration tests, i.e., the vcl tests in > src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We could do > something similar for the proxy app, but the tests need to be lightweight > as they’re run as part of the verify jobs > - CSIT scale/performance tests. We could use something like your scripts > to test the proxy but also ld_preload + nginx and other applications. Dave > should have more opinions here :-) > > Regards, > Florin > > [1] https://gerrit.fd.io/r/c/vpp/+/28041 > > On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com> wrote: > > Concerning the CI: I'd be glad to add that test to "make test", but not > sure how to approach it. The test is not about containers but more about > using network namespaces and some tools like wrk to create a lot of TCP > connections to do some "stress testing" of VPP host stack (and as it was > noted, it fails not only on the proxy example, but also on http_static > plugin). It's probably doable w/o any external tooling at all, and even > without the network namespaces either, using only VPP's own TCP stack, but > that is probably rather hard. Could you suggest some ideas how it could be > added to "make test"? Should I add a `test_....py` under `tests/` that > creates host interfaces in VPP and uses these via OS networking instead of > the packet generator? As far as I can see there's something like that in > srv6-mobile plugin [1]. > > [1] > https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125 > > On Wed, Jul 22, 2020 at 8:25 PM Florin Coras <fcoras.li...@gmail.com> > wrote: > >> I missed the point about the CI in my other reply. If we can somehow >> integrate some container based tests into the “make test” infra, I wouldn’t >> mind at all! :-) >> >> Regards, >> Florin >> >> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com> wrote: >> >> Hi, >> sadly the patch apparently didn't work. It should have worked but for >> some reason it didn't ... >> >> On the bright side, I've made a test case [1] using fresh upstream VPP >> code with no UPF that reproduces the issues I mentioned, including both >> timer and TCP retransmit one along with some other possible problems using >> http_static plugin and the proxy example, along with nginx (with proxy) and >> wrk. >> >> It is docker-based, but the main scripts (start.sh and test.sh) can be >> used without Docker, too. >> I've used our own Dockerfiles to build the images, but I'm not sure if >> that makes any difference. >> I've added some log files resulting from the runs that crashed in >> different places. For me, the tests crash on each run, but in different >> places. >> >> The TCP retransmit problem happens with http_static when using the debug >> build. When using release build, some unrelated crash in timer_remove() >> happens instead. >> The SVM FIFO crash happens when using the proxy. It can happen with both >> release and debug builds. >> >> Please see the repo [1] for details and crash logs. >> >> [1] https://github.com/ivan4th/vpp-tcp-test >> >> P.S. As the tests do expose some problems with VPP host stack and some >> of the VPP plugins/examples, maybe we should consider adding them to the >> VPP CI, too? >> >> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <fcoras.li...@gmail.com> >> wrote: >> >>> Hi Ivan, >>> >>> Thanks for the detailed report! >>> >>> I assume this is a situation where most of the connections time out and >>> the rate limiting we apply on the pending timer queue delays handling for >>> long enough to be in a situation like the one you described. Here’s a draft >>> patch that starts tracking pending timers [1]. Let me know if it solves the >>> first problem. >>> >>> Regarding the second, it looks like the first chunk in the fifo is not >>> properly initialized/corrupted. It’s hard to tell what leads to that given >>> that I haven’t seen this sort of issues even with larger number of >>> connections. You could maybe try calling svm_fifo_is_sane() in the >>> enqueue/dequeue functions, or after the proxy allocates/shares the fifos to >>> catch the issue as early as possible. >>> >>> Regards, >>> Florin >>> >>> [1] https://gerrit.fd.io/r/c/vpp/+/27952 >>> >>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com wrote: >>> >>> Hi, >>> I'm working on the Travelping UPF project >>> https://github.com/travelping/vpp <https://github.com/travelping/vpp>For >>> variety of reasons, it's presently maintained as a fork of UPF that's >>> rebased on top of upstream master from time to time, but really it's just a >>> plugin. During 40K TCP connection test with netem, I found an issue with >>> TCP timer race (timers firing after tcp_timer_reset() was called for them) >>> which I tried to work around only to stumble into another crash, which I'm >>> presently debugging (an SVM FIFO bug, possibly) but maybe some of you folks >>> have some ideas what it could be. >>> I've described my findings in this JIRA ticket: >>> https://jira.fd.io/browse/VPP-1923 >>> Although the last upstream commit UPF is presently based on >>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1) >>> was some time ago, I believe the problems are still relevant as there >>> were no changes in these parts of code in master since that commit. >>> >>> >>> >>> >> >> -- >> Ivan Shvedunov <ivan...@gmail.com> >> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 >> 2807 >> >> >> > > -- > Ivan Shvedunov <ivan...@gmail.com> > ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 > 2807 > > > -- Ivan Shvedunov <ivan...@gmail.com> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 2807
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17056): https://lists.fd.io/g/vpp-dev/message/17056 Mute This Topic: https://lists.fd.io/mt/75537746/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-