Hi Ivan, 

Thanks for the test. After modifying it a bit to run straight from binaries, I 
managed to repro the issue. As expected, the proxy is not cleaning up the 
sessions correctly (example apps do run out of sync ..). Here’s a quick patch 
that solves some of the obvious issues [1] (note that it’s chained with gerrit 
27952). I didn’t do too much testing, so let me know if you hit some other 
problems. As far as I can tell, 27952 is needed. 

As for the CI, I guess there are two types of tests we might want (cc-ing Dave 
since he has experience with this):
- functional test that could live as part of “make test” infra. The host stack 
already has some functional integration tests, i.e., the vcl tests in 
src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We could do something 
  similar for the proxy app, but the tests need to be lightweight as they’re 
run as part of the verify jobs
- CSIT scale/performance tests. We could use something like your scripts to 
test the proxy but also ld_preload + nginx and other applications. Dave should 
have more opinions here :-)

Regards, 
Florin 

[1] https://gerrit.fd.io/r/c/vpp/+/28041

> On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com> wrote:
> 
> Concerning the CI: I'd be glad to add that test to "make test", but not sure 
> how to approach it. The test is not about containers but more about using 
> network namespaces and some tools like wrk to create a lot of TCP connections 
> to do some "stress testing" of VPP host stack (and as it was noted, it fails 
> not only on the proxy example, but also on http_static plugin). It's probably 
> doable w/o any external tooling at all, and even without the network 
> namespaces either, using only VPP's own TCP stack, but that is probably 
> rather hard. Could you suggest some ideas how it could be added to "make 
> test"? Should I add a `test_....py` under `tests/` that creates host 
> interfaces in VPP and uses these via OS networking instead of the packet 
> generator? As far as I can see there's something like that in srv6-mobile 
> plugin [1].
> 
> [1] 
> https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125
>  
> <https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125>
> On Wed, Jul 22, 2020 at 8:25 PM Florin Coras <fcoras.li...@gmail.com 
> <mailto:fcoras.li...@gmail.com>> wrote:
> I missed the point about the CI in my other reply. If we can somehow 
> integrate some container based tests into the “make test” infra, I wouldn’t 
> mind at all! :-)
> 
> Regards,
> Florin
> 
>> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com 
>> <mailto:ivan...@gmail.com>> wrote:
>> 
>> Hi,
>> sadly the patch apparently didn't work. It should have worked but for some 
>> reason it didn't ...
>> 
>> On the bright side, I've made a test case [1] using fresh upstream VPP code 
>> with no UPF that reproduces the issues I mentioned, including both timer and 
>> TCP retransmit one along with some other possible problems using http_static 
>> plugin and the proxy example, along with nginx (with proxy) and wrk.
>> 
>> It is docker-based, but the main scripts (start.sh and test.sh) can be used 
>> without Docker, too.
>> I've used our own Dockerfiles to build the images, but I'm not sure if that 
>> makes any difference. 
>> I've added some log files resulting from the runs that crashed in different 
>> places. For me, the tests crash on each run, but in different places.
>> 
>> The TCP retransmit problem happens with http_static when using the debug 
>> build. When using release build, some unrelated crash in timer_remove() 
>> happens instead.
>> The SVM FIFO crash happens when using the proxy. It can happen with both 
>> release and debug builds.
>> 
>> Please see the repo [1] for details and crash logs.
>> 
>> [1] https://github.com/ivan4th/vpp-tcp-test 
>> <https://github.com/ivan4th/vpp-tcp-test>
>> 
>> P.S. As the tests do expose some problems with VPP host stack and some of 
>> the VPP plugins/examples, maybe we should consider adding them to the VPP 
>> CI, too?
>> 
>> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <fcoras.li...@gmail.com 
>> <mailto:fcoras.li...@gmail.com>> wrote:
>> Hi Ivan, 
>> 
>> Thanks for the detailed report!
>> 
>> I assume this is a situation where most of the connections time out and the 
>> rate limiting we apply on the pending timer queue delays handling for long 
>> enough to be in a situation like the one you described. Here’s a draft patch 
>> that starts tracking pending timers [1]. Let me know if it solves the first 
>> problem. 
>> 
>> Regarding the second, it looks like the first chunk in the fifo is not 
>> properly initialized/corrupted. It’s hard to tell what leads to that given 
>> that I haven’t seen this sort of issues even with larger number of 
>> connections. You could maybe try calling svm_fifo_is_sane() in the 
>> enqueue/dequeue functions, or after the proxy allocates/shares the fifos to 
>> catch the issue as early as possible. 
>> 
>> Regards, 
>> Florin
>> 
>> [1] https://gerrit.fd.io/r/c/vpp/+/27952 
>> <https://gerrit.fd.io/r/c/vpp/+/27952>
>> 
>>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com <mailto:ivan...@gmail.com> 
>>> wrote:
>>> 
>>>   Hi,
>>>   I'm working on the Travelping UPF project 
>>> https://github.com/travelping/vpp  <https://github.com/travelping/vpp>For 
>>> variety of reasons, it's presently maintained as a fork of UPF that's 
>>> rebased on top of upstream master from time to time, but really it's just a 
>>> plugin. During 40K TCP connection test with netem, I found an issue with 
>>> TCP timer race (timers firing after tcp_timer_reset() was called for them) 
>>> which I tried to work around only to stumble into another crash, which I'm 
>>> presently debugging (an SVM FIFO bug, possibly) but maybe some of you folks 
>>> have some ideas what it could be.
>>>   I've described my findings in this JIRA ticket: 
>>> https://jira.fd.io/browse/VPP-1923 <https://jira.fd.io/browse/VPP-1923>
>>>   Although the last upstream commit UPF is presently based on 
>>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1) was some time ago, I believe  
>>> the problems are still relevant as there were no changes in these parts of 
>>> code in master since that commit. 
>> 
>> 
>> 
>> -- 
>> Ivan Shvedunov <ivan...@gmail.com <mailto:ivan...@gmail.com>>
>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807
> 
> 
> 
> -- 
> Ivan Shvedunov <ivan...@gmail.com <mailto:ivan...@gmail.com>>
> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17048): https://lists.fd.io/g/vpp-dev/message/17048
Mute This Topic: https://lists.fd.io/mt/75537746/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to