http_static produces some errors:
/usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http session
for thread 0 session_index 4124
/usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http session
for thread 0 session_index 4124
/usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp error
state CLOSE_WAIT flags 0x02 SYN
/usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp error
state CLOSE_WAIT flags 0x02 SYN
/usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13350 disp error
state CLOSE_WAIT flags 0x02 SYN

also with multiple different CP connection states related to connections
being closed and receiving SYN / SYN+ACK.
The release build crashes (did already happen before, so it's unrelated to
any of the fixes [1]):

/usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header!
/usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header!

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828, elt=<optimized
out>)
    at /src/vpp/src/vppinfra/tw_timer_template.c:154
154     /src/vpp/src/vppinfra/tw_timer_template.c: No such file or
directory.
#0  0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828,
    elt=<optimized out>) at /src/vpp/src/vppinfra/tw_timer_template.c:154
#1  tw_timer_stop_2t_1w_2048sl (
    tw=0x7fffb0967728 <http_static_server_main+288>, handle=7306)
    at /src/vpp/src/vppinfra/tw_timer_template.c:374
---Type <return> to continue, or q <return> to quit---
#2  0x00007fffb076146f in http_static_server_session_timer_stop
(hs=<optimized out>)
    at /src/vpp/src/plugins/http_static/static_server.c:126
#3  http_static_server_rx_tx_callback (s=0x7fffb5e13a40, cf=CALLED_FROM_RX)
    at /src/vpp/src/plugins/http_static/static_server.c:1026
#4  0x00007fffb0760eb8 in http_static_server_rx_callback (
    s=0x7fffb0967728 <http_static_server_main+288>)
    at /src/vpp/src/plugins/http_static/static_server.c:1037
#5  0x00007ffff774a9de in app_worker_builtin_rx (app_wrk=<optimized out>,
s=0x7fffb5e13a40)
    at /src/vpp/src/vnet/session/application_worker.c:485
#6  app_send_io_evt_rx (app_wrk=<optimized out>, s=0x7fffb5e13a40)
    at /src/vpp/src/vnet/session/application_worker.c:691
#7  0x00007ffff7713d9a in session_enqueue_notify_inline (s=0x7fffb5e13a40)
    at /src/vpp/src/vnet/session/session.c:632
#8  0x00007ffff7713fd1 in session_main_flush_enqueue_events
(transport_proto=<optimized out>,
    thread_index=0) at /src/vpp/src/vnet/session/session.c:736
#9  0x00007ffff63960e9 in tcp46_established_inline (vm=0x7ffff5ddc6c0
<vlib_global_main>,
    node=<optimized out>, frame=<optimized out>, is_ip4=1) at
/src/vpp/src/vnet/tcp/tcp_input.c:1558
#10 tcp4_established_node_fn_hsw (vm=0x7ffff5ddc6c0 <vlib_global_main>,
node=<optimized out>,
    from_frame=0x7fffb5458480) at /src/vpp/src/vnet/tcp/tcp_input.c:1573
#11 0x00007ffff5b5f509 in dispatch_node (vm=0x7ffff5ddc6c0
<vlib_global_main>, node=0x7fffb4baf400,
    type=VLIB_NODE_TYPE_INTERNAL, dispatch_state=VLIB_NODE_STATE_POLLING,
frame=<optimized out>,
    last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1194
#12 dispatch_pending_node (vm=0x7ffff5ddc6c0 <vlib_global_main>,
pending_frame_index=<optimized out>,
    last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1353
#13 vlib_main_or_worker_loop (vm=<optimized out>, is_main=1) at
/src/vpp/src/vlib/main.c:1848
#14 vlib_main_loop (vm=<optimized out>) at /src/vpp/src/vlib/main.c:1976
#15 0x00007ffff5b5daf0 in vlib_main (vm=0x7ffff5ddc6c0 <vlib_global_main>,
input=0x7fffb4762fb0)
    at /src/vpp/src/vlib/main.c:2222
#16 0x00007ffff5bc2816 in thread0 (arg=140737318340288) at
/src/vpp/src/vlib/unix/main.c:660
#17 0x00007ffff4fa9ec4 in clib_calljmp () from
/usr/lib/x86_64-linux-gnu/libvppinfra.so.20.09
#18 0x00007fffffffd8b0 in ?? ()
#19 0x00007ffff5bc27c8 in vlib_unix_main (argc=<optimized out>,
argv=<optimized out>)
    at /src/vpp/src/vlib/unix/main.c:733

[1]
https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-release-http_static-timer_remove.log

On Thu, Jul 23, 2020 at 2:47 PM Ivan Shvedunov via lists.fd.io <ivan4th=
gmail....@lists.fd.io> wrote:

> Hi,
> I've found a problem with the timer fix and commented in Gerrit [1]
> accordingly.
> Basically this change [2] makes the tcp_prepare_retransmit_segment() issue
> go away for me.
>
> Concerning the proxy example, I can no longer see the SVM FIFO crashes,
> but when using debug build, VPP crashes with this error (full log [3])
> during my test:
> /usr/bin/vpp[39]: /src/vpp/src/vnet/tcp/tcp_input.c:2857
> (tcp46_input_inline) assertion `tcp_lookup_is_valid (tc1, b[1],
> tcp_buffer_hdr (b[1]))' fails
>
> When using release build, it produces a lot of messages like this instead:
> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 15168 disp
> error state CLOSE_WAIT flags 0x02 SYN
> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 9417 disp error
> state FIN_WAIT_2 flags 0x12 SYN ACK
> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 10703 disp
> error state TIME_WAIT flags 0x12 SYN ACK
>
> and also
>
> /usr/bin/vpp[39]: active_open_connected_callback:439: connection 85557
> failed!
>
> [1] https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39
> [2]
> https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca
> [3]
> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71
>
> I will look into src/vcl/test/test_vcl.py to see if I can reproduce
> something like my test there, thanks!
> And waiting for Dave's input concerning the CSIT part, too, of course.
>
>
> On Thu, Jul 23, 2020 at 5:22 AM Florin Coras <fcoras.li...@gmail.com>
> wrote:
>
>> Hi Ivan,
>>
>> Thanks for the test. After modifying it a bit to run straight from
>> binaries, I managed to repro the issue. As expected, the proxy is not
>> cleaning up the sessions correctly (example apps do run out of sync ..).
>> Here’s a quick patch that solves some of the obvious issues [1] (note that
>> it’s chained with gerrit 27952). I didn’t do too much testing, so let me
>> know if you hit some other problems. As far as I can tell, 27952 is needed.
>>
>> As for the CI, I guess there are two types of tests we might want (cc-ing
>> Dave since he has experience with this):
>> - functional test that could live as part of “make test” infra. The host
>> stack already has some functional integration tests, i.e., the vcl tests in
>> src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We could do
>> something   similar for the proxy app, but the tests need to be lightweight
>> as they’re run as part of the verify jobs
>> - CSIT scale/performance tests. We could use something like your scripts
>> to test the proxy but also ld_preload + nginx and other applications. Dave
>> should have more opinions here :-)
>>
>> Regards,
>> Florin
>>
>> [1] https://gerrit.fd.io/r/c/vpp/+/28041
>>
>> On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>
>> Concerning the CI: I'd be glad to add that test to "make test", but not
>> sure how to approach it. The test is not about containers but more about
>> using network namespaces and some tools like wrk to create a lot of TCP
>> connections to do some "stress testing" of VPP host stack (and as it was
>> noted, it fails not only on the proxy example, but also on http_static
>> plugin). It's probably doable w/o any external tooling at all, and even
>> without the network namespaces either, using only VPP's own TCP stack, but
>> that is probably rather hard. Could you suggest some ideas how it could be
>> added to "make test"? Should I add a `test_....py` under `tests/` that
>> creates host interfaces in VPP and uses these via OS networking instead of
>> the packet generator? As far as I can see there's something like that in
>> srv6-mobile plugin [1].
>>
>> [1]
>> https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125
>>
>> On Wed, Jul 22, 2020 at 8:25 PM Florin Coras <fcoras.li...@gmail.com>
>> wrote:
>>
>>> I missed the point about the CI in my other reply. If we can somehow
>>> integrate some container based tests into the “make test” infra, I wouldn’t
>>> mind at all! :-)
>>>
>>> Regards,
>>> Florin
>>>
>>> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>>
>>> Hi,
>>> sadly the patch apparently didn't work. It should have worked but for
>>> some reason it didn't ...
>>>
>>> On the bright side, I've made a test case [1] using fresh upstream VPP
>>> code with no UPF that reproduces the issues I mentioned, including both
>>> timer and TCP retransmit one along with some other possible problems using
>>> http_static plugin and the proxy example, along with nginx (with proxy) and
>>> wrk.
>>>
>>> It is docker-based, but the main scripts (start.sh and test.sh) can be
>>> used without Docker, too.
>>> I've used our own Dockerfiles to build the images, but I'm not sure if
>>> that makes any difference.
>>> I've added some log files resulting from the runs that crashed in
>>> different places. For me, the tests crash on each run, but in different
>>> places.
>>>
>>> The TCP retransmit problem happens with http_static when using the debug
>>> build. When using release build, some unrelated crash in timer_remove()
>>> happens instead.
>>> The SVM FIFO crash happens when using the proxy. It can happen with both
>>> release and debug builds.
>>>
>>> Please see the repo [1] for details and crash logs.
>>>
>>> [1] https://github.com/ivan4th/vpp-tcp-test
>>>
>>> P.S. As the tests do expose some problems with VPP host stack and some
>>> of the VPP plugins/examples, maybe we should consider adding them to the
>>> VPP CI, too?
>>>
>>> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <fcoras.li...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ivan,
>>>>
>>>> Thanks for the detailed report!
>>>>
>>>> I assume this is a situation where most of the connections time out and
>>>> the rate limiting we apply on the pending timer queue delays handling for
>>>> long enough to be in a situation like the one you described. Here’s a draft
>>>> patch that starts tracking pending timers [1]. Let me know if it solves the
>>>> first problem.
>>>>
>>>> Regarding the second, it looks like the first chunk in the fifo is not
>>>> properly initialized/corrupted. It’s hard to tell what leads to that given
>>>> that I haven’t seen this sort of issues even with larger number of
>>>> connections. You could maybe try calling svm_fifo_is_sane() in the
>>>> enqueue/dequeue functions, or after the proxy allocates/shares the fifos to
>>>> catch the issue as early as possible.
>>>>
>>>> Regards,
>>>> Florin
>>>>
>>>> [1] https://gerrit.fd.io/r/c/vpp/+/27952
>>>>
>>>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com wrote:
>>>>
>>>>   Hi,
>>>>   I'm working on the Travelping UPF project
>>>> https://github.com/travelping/vpp  <https://github.com/travelping/vpp>For
>>>> variety of reasons, it's presently maintained as a fork of UPF that's
>>>> rebased on top of upstream master from time to time, but really it's just a
>>>> plugin. During 40K TCP connection test with netem, I found an issue with
>>>> TCP timer race (timers firing after tcp_timer_reset() was called for them)
>>>> which I tried to work around only to stumble into another crash, which I'm
>>>> presently debugging (an SVM FIFO bug, possibly) but maybe some of you folks
>>>> have some ideas what it could be.
>>>>   I've described my findings in this JIRA ticket:
>>>> https://jira.fd.io/browse/VPP-1923
>>>>   Although the last upstream commit UPF is presently based on 
>>>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1)
>>>> was some time ago, I believe  the problems are still relevant as there
>>>> were no changes in these parts of code in master since that commit.
>>>>
>>>>
>>>>
>>>
>>> --
>>> Ivan Shvedunov <ivan...@gmail.com>
>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>>> 2807
>>>
>>>
>>>
>>
>> --
>> Ivan Shvedunov <ivan...@gmail.com>
>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>> 2807
>>
>>
>>
>
> --
> Ivan Shvedunov <ivan...@gmail.com>
> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
> 2807
> 
>


-- 
Ivan Shvedunov <ivan...@gmail.com>
;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17057): https://lists.fd.io/g/vpp-dev/message/17057
Mute This Topic: https://lists.fd.io/mt/75537746/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to