Hi Ivan, Thanks for the test. After modifying it a bit to run straight from binaries, I managed to repro the issue. As expected, the proxy is not cleaning up the sessions correctly (example apps do run out of sync ..). Here’s a quick patch that solves some of the obvious issues [1] (note that it’s chained with gerrit 27952). I didn’t do too much testing, so let me know if you hit some other problems. As far as I can tell, 27952 is needed.
As for the CI, I guess there are two types of tests we might want (cc-ing Dave since he has experience with this): - functional test that could live as part of “make test” infra. The host stack already has some functional integration tests, i.e., the vcl tests in src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We could do something similar for the proxy app, but the tests need to be lightweight as they’re run as part of the verify jobs - CSIT scale/performance tests. We could use something like your scripts to test the proxy but also ld_preload + nginx and other applications. Dave should have more opinions here :-) Regards, Florin [1] https://gerrit.fd.io/r/c/vpp/+/28041 > On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com> wrote: > > Concerning the CI: I'd be glad to add that test to "make test", but not sure > how to approach it. The test is not about containers but more about using > network namespaces and some tools like wrk to create a lot of TCP connections > to do some "stress testing" of VPP host stack (and as it was noted, it fails > not only on the proxy example, but also on http_static plugin). It's probably > doable w/o any external tooling at all, and even without the network > namespaces either, using only VPP's own TCP stack, but that is probably > rather hard. Could you suggest some ideas how it could be added to "make > test"? Should I add a `test_....py` under `tests/` that creates host > interfaces in VPP and uses these via OS networking instead of the packet > generator? As far as I can see there's something like that in srv6-mobile > plugin [1]. > > [1] > https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125 > > <https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125> > On Wed, Jul 22, 2020 at 8:25 PM Florin Coras <fcoras.li...@gmail.com > <mailto:fcoras.li...@gmail.com>> wrote: > I missed the point about the CI in my other reply. If we can somehow > integrate some container based tests into the “make test” infra, I wouldn’t > mind at all! :-) > > Regards, > Florin > >> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com >> <mailto:ivan...@gmail.com>> wrote: >> >> Hi, >> sadly the patch apparently didn't work. It should have worked but for some >> reason it didn't ... >> >> On the bright side, I've made a test case [1] using fresh upstream VPP code >> with no UPF that reproduces the issues I mentioned, including both timer and >> TCP retransmit one along with some other possible problems using http_static >> plugin and the proxy example, along with nginx (with proxy) and wrk. >> >> It is docker-based, but the main scripts (start.sh and test.sh) can be used >> without Docker, too. >> I've used our own Dockerfiles to build the images, but I'm not sure if that >> makes any difference. >> I've added some log files resulting from the runs that crashed in different >> places. For me, the tests crash on each run, but in different places. >> >> The TCP retransmit problem happens with http_static when using the debug >> build. When using release build, some unrelated crash in timer_remove() >> happens instead. >> The SVM FIFO crash happens when using the proxy. It can happen with both >> release and debug builds. >> >> Please see the repo [1] for details and crash logs. >> >> [1] https://github.com/ivan4th/vpp-tcp-test >> <https://github.com/ivan4th/vpp-tcp-test> >> >> P.S. As the tests do expose some problems with VPP host stack and some of >> the VPP plugins/examples, maybe we should consider adding them to the VPP >> CI, too? >> >> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <fcoras.li...@gmail.com >> <mailto:fcoras.li...@gmail.com>> wrote: >> Hi Ivan, >> >> Thanks for the detailed report! >> >> I assume this is a situation where most of the connections time out and the >> rate limiting we apply on the pending timer queue delays handling for long >> enough to be in a situation like the one you described. Here’s a draft patch >> that starts tracking pending timers [1]. Let me know if it solves the first >> problem. >> >> Regarding the second, it looks like the first chunk in the fifo is not >> properly initialized/corrupted. It’s hard to tell what leads to that given >> that I haven’t seen this sort of issues even with larger number of >> connections. You could maybe try calling svm_fifo_is_sane() in the >> enqueue/dequeue functions, or after the proxy allocates/shares the fifos to >> catch the issue as early as possible. >> >> Regards, >> Florin >> >> [1] https://gerrit.fd.io/r/c/vpp/+/27952 >> <https://gerrit.fd.io/r/c/vpp/+/27952> >> >>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com <mailto:ivan...@gmail.com> >>> wrote: >>> >>> Hi, >>> I'm working on the Travelping UPF project >>> https://github.com/travelping/vpp <https://github.com/travelping/vpp>For >>> variety of reasons, it's presently maintained as a fork of UPF that's >>> rebased on top of upstream master from time to time, but really it's just a >>> plugin. During 40K TCP connection test with netem, I found an issue with >>> TCP timer race (timers firing after tcp_timer_reset() was called for them) >>> which I tried to work around only to stumble into another crash, which I'm >>> presently debugging (an SVM FIFO bug, possibly) but maybe some of you folks >>> have some ideas what it could be. >>> I've described my findings in this JIRA ticket: >>> https://jira.fd.io/browse/VPP-1923 <https://jira.fd.io/browse/VPP-1923> >>> Although the last upstream commit UPF is presently based on >>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1) was some time ago, I believe >>> the problems are still relevant as there were no changes in these parts of >>> code in master since that commit. >> >> >> >> -- >> Ivan Shvedunov <ivan...@gmail.com <mailto:ivan...@gmail.com>> >> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 2807 > > > > -- > Ivan Shvedunov <ivan...@gmail.com <mailto:ivan...@gmail.com>> > ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 2807
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17048): https://lists.fd.io/g/vpp-dev/message/17048 Mute This Topic: https://lists.fd.io/mt/75537746/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-