Hi, sadly the patch apparently didn't work. It should have worked but for some reason it didn't ...
On the bright side, I've made a test case [1] using fresh upstream VPP code with no UPF that reproduces the issues I mentioned, including both timer and TCP retransmit one along with some other possible problems using http_static plugin and the proxy example, along with nginx (with proxy) and wrk. It is docker-based, but the main scripts (start.sh and test.sh) can be used without Docker, too. I've used our own Dockerfiles to build the images, but I'm not sure if that makes any difference. I've added some log files resulting from the runs that crashed in different places. For me, the tests crash on each run, but in different places. The TCP retransmit problem happens with http_static when using the debug build. When using release build, some unrelated crash in timer_remove() happens instead. The SVM FIFO crash happens when using the proxy. It can happen with both release and debug builds. Please see the repo [1] for details and crash logs. [1] https://github.com/ivan4th/vpp-tcp-test P.S. As the tests do expose some problems with VPP host stack and some of the VPP plugins/examples, maybe we should consider adding them to the VPP CI, too? On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <fcoras.li...@gmail.com> wrote: > Hi Ivan, > > Thanks for the detailed report! > > I assume this is a situation where most of the connections time out and > the rate limiting we apply on the pending timer queue delays handling for > long enough to be in a situation like the one you described. Here’s a draft > patch that starts tracking pending timers [1]. Let me know if it solves the > first problem. > > Regarding the second, it looks like the first chunk in the fifo is not > properly initialized/corrupted. It’s hard to tell what leads to that given > that I haven’t seen this sort of issues even with larger number of > connections. You could maybe try calling svm_fifo_is_sane() in the > enqueue/dequeue functions, or after the proxy allocates/shares the fifos to > catch the issue as early as possible. > > Regards, > Florin > > [1] https://gerrit.fd.io/r/c/vpp/+/27952 > > On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com wrote: > > Hi, > I'm working on the Travelping UPF project > https://github.com/travelping/vpp <https://github.com/travelping/vpp>For > variety of reasons, it's presently maintained as a fork of UPF that's > rebased on top of upstream master from time to time, but really it's just a > plugin. During 40K TCP connection test with netem, I found an issue with > TCP timer race (timers firing after tcp_timer_reset() was called for them) > which I tried to work around only to stumble into another crash, which I'm > presently debugging (an SVM FIFO bug, possibly) but maybe some of you folks > have some ideas what it could be. > I've described my findings in this JIRA ticket: > https://jira.fd.io/browse/VPP-1923 > Although the last upstream commit UPF is presently based on > (afc233aa93c3f23b30b756cb4ae2967f968bbbb1) > was some time ago, I believe the problems are still relevant as there > were no changes in these parts of code in master since that commit. > > > > -- Ivan Shvedunov <ivan...@gmail.com> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9 F7D0 613E C0F8 0BC5 2807
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17032): https://lists.fd.io/g/vpp-dev/message/17032 Mute This Topic: https://lists.fd.io/mt/75537746/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-