Hi,
I've found a problem with the timer fix and commented in Gerrit [1]
accordingly.
Basically this change [2] makes the tcp_prepare_retransmit_segment() issue
go away for me.

Concerning the proxy example, I can no longer see the SVM FIFO crashes, but
when using debug build, VPP crashes with this error (full log [3]) during
my test:
/usr/bin/vpp[39]: /src/vpp/src/vnet/tcp/tcp_input.c:2857
(tcp46_input_inline) assertion `tcp_lookup_is_valid (tc1, b[1],
tcp_buffer_hdr (b[1]))' fails

When using release build, it produces a lot of messages like this instead:
/usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 15168 disp error
state CLOSE_WAIT flags 0x02 SYN
/usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 9417 disp error
state FIN_WAIT_2 flags 0x12 SYN ACK
/usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 10703 disp error
state TIME_WAIT flags 0x12 SYN ACK

and also

/usr/bin/vpp[39]: active_open_connected_callback:439: connection 85557
failed!

[1] https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39
[2]
https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca
[3]
https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71

I will look into src/vcl/test/test_vcl.py to see if I can reproduce
something like my test there, thanks!
And waiting for Dave's input concerning the CSIT part, too, of course.


On Thu, Jul 23, 2020 at 5:22 AM Florin Coras <fcoras.li...@gmail.com> wrote:

> Hi Ivan,
>
> Thanks for the test. After modifying it a bit to run straight from
> binaries, I managed to repro the issue. As expected, the proxy is not
> cleaning up the sessions correctly (example apps do run out of sync ..).
> Here’s a quick patch that solves some of the obvious issues [1] (note that
> it’s chained with gerrit 27952). I didn’t do too much testing, so let me
> know if you hit some other problems. As far as I can tell, 27952 is needed.
>
> As for the CI, I guess there are two types of tests we might want (cc-ing
> Dave since he has experience with this):
> - functional test that could live as part of “make test” infra. The host
> stack already has some functional integration tests, i.e., the vcl tests in
> src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We could do
> something   similar for the proxy app, but the tests need to be lightweight
> as they’re run as part of the verify jobs
> - CSIT scale/performance tests. We could use something like your scripts
> to test the proxy but also ld_preload + nginx and other applications. Dave
> should have more opinions here :-)
>
> Regards,
> Florin
>
> [1] https://gerrit.fd.io/r/c/vpp/+/28041
>
> On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>
> Concerning the CI: I'd be glad to add that test to "make test", but not
> sure how to approach it. The test is not about containers but more about
> using network namespaces and some tools like wrk to create a lot of TCP
> connections to do some "stress testing" of VPP host stack (and as it was
> noted, it fails not only on the proxy example, but also on http_static
> plugin). It's probably doable w/o any external tooling at all, and even
> without the network namespaces either, using only VPP's own TCP stack, but
> that is probably rather hard. Could you suggest some ideas how it could be
> added to "make test"? Should I add a `test_....py` under `tests/` that
> creates host interfaces in VPP and uses these via OS networking instead of
> the packet generator? As far as I can see there's something like that in
> srv6-mobile plugin [1].
>
> [1]
> https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125
>
> On Wed, Jul 22, 2020 at 8:25 PM Florin Coras <fcoras.li...@gmail.com>
> wrote:
>
>> I missed the point about the CI in my other reply. If we can somehow
>> integrate some container based tests into the “make test” infra, I wouldn’t
>> mind at all! :-)
>>
>> Regards,
>> Florin
>>
>> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>
>> Hi,
>> sadly the patch apparently didn't work. It should have worked but for
>> some reason it didn't ...
>>
>> On the bright side, I've made a test case [1] using fresh upstream VPP
>> code with no UPF that reproduces the issues I mentioned, including both
>> timer and TCP retransmit one along with some other possible problems using
>> http_static plugin and the proxy example, along with nginx (with proxy) and
>> wrk.
>>
>> It is docker-based, but the main scripts (start.sh and test.sh) can be
>> used without Docker, too.
>> I've used our own Dockerfiles to build the images, but I'm not sure if
>> that makes any difference.
>> I've added some log files resulting from the runs that crashed in
>> different places. For me, the tests crash on each run, but in different
>> places.
>>
>> The TCP retransmit problem happens with http_static when using the debug
>> build. When using release build, some unrelated crash in timer_remove()
>> happens instead.
>> The SVM FIFO crash happens when using the proxy. It can happen with both
>> release and debug builds.
>>
>> Please see the repo [1] for details and crash logs.
>>
>> [1] https://github.com/ivan4th/vpp-tcp-test
>>
>> P.S. As the tests do expose some problems with VPP host stack and some
>> of the VPP plugins/examples, maybe we should consider adding them to the
>> VPP CI, too?
>>
>> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <fcoras.li...@gmail.com>
>> wrote:
>>
>>> Hi Ivan,
>>>
>>> Thanks for the detailed report!
>>>
>>> I assume this is a situation where most of the connections time out and
>>> the rate limiting we apply on the pending timer queue delays handling for
>>> long enough to be in a situation like the one you described. Here’s a draft
>>> patch that starts tracking pending timers [1]. Let me know if it solves the
>>> first problem.
>>>
>>> Regarding the second, it looks like the first chunk in the fifo is not
>>> properly initialized/corrupted. It’s hard to tell what leads to that given
>>> that I haven’t seen this sort of issues even with larger number of
>>> connections. You could maybe try calling svm_fifo_is_sane() in the
>>> enqueue/dequeue functions, or after the proxy allocates/shares the fifos to
>>> catch the issue as early as possible.
>>>
>>> Regards,
>>> Florin
>>>
>>> [1] https://gerrit.fd.io/r/c/vpp/+/27952
>>>
>>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com wrote:
>>>
>>>   Hi,
>>>   I'm working on the Travelping UPF project
>>> https://github.com/travelping/vpp  <https://github.com/travelping/vpp>For
>>> variety of reasons, it's presently maintained as a fork of UPF that's
>>> rebased on top of upstream master from time to time, but really it's just a
>>> plugin. During 40K TCP connection test with netem, I found an issue with
>>> TCP timer race (timers firing after tcp_timer_reset() was called for them)
>>> which I tried to work around only to stumble into another crash, which I'm
>>> presently debugging (an SVM FIFO bug, possibly) but maybe some of you folks
>>> have some ideas what it could be.
>>>   I've described my findings in this JIRA ticket:
>>> https://jira.fd.io/browse/VPP-1923
>>>   Although the last upstream commit UPF is presently based on 
>>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1)
>>> was some time ago, I believe  the problems are still relevant as there
>>> were no changes in these parts of code in master since that commit.
>>> 
>>>
>>>
>>>
>>
>> --
>> Ivan Shvedunov <ivan...@gmail.com>
>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>> 2807
>>
>>
>>
>
> --
> Ivan Shvedunov <ivan...@gmail.com>
> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
> 2807
>
>
>

-- 
Ivan Shvedunov <ivan...@gmail.com>
;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17056): https://lists.fd.io/g/vpp-dev/message/17056
Mute This Topic: https://lists.fd.io/mt/75537746/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to