Well, I always run the same test, the difference being only
"test.sh 80" for http_static (it's configured to be listening on that port)
or just "test.sh" for the proxy. As far as I understand, you run the tests
without using the containers, does that include setting up netem like this
[1] ?

[1] https://github.com/ivan4th/vpp-tcp-test/blob/a3b02ec/start.sh#L34-L35

On Thu, Jul 23, 2020 at 5:10 PM Florin Coras <fcoras.li...@gmail.com> wrote:

> Hi Ivan,
>
> Updated [1] but I’m not seeing [3] after several test iterations.
>
> Probably the static server needs the same treatment as the proxy. Are you
> running a slightly different test? All of the builtin apps have the
> potential to crash vpp or leave the host stack in an unwanted state since
> they run inline.
>
> Either way, to solve this, first step would be to get rid of error like,
> “no http session for thread 0 session_index x”. Will eventually try to look
> into it if nobody beats me to it.
>
> Regards,
> Florin
>
> On Jul 23, 2020, at 4:59 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>
> http_static produces some errors:
> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http session
> for thread 0 session_index 4124
> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http session
> for thread 0 session_index 4124
> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp
> error state CLOSE_WAIT flags 0x02 SYN
> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp
> error state CLOSE_WAIT flags 0x02 SYN
> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13350 disp
> error state CLOSE_WAIT flags 0x02 SYN
>
> also with multiple different CP connection states related to connections
> being closed and receiving SYN / SYN+ACK.
> The release build crashes (did already happen before, so it's unrelated to
> any of the fixes [1]):
>
> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header!
> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header!
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828, elt=<optimized
> out>)
>     at /src/vpp/src/vppinfra/tw_timer_template.c:154
> 154     /src/vpp/src/vppinfra/tw_timer_template.c: No such file or
> directory.
> #0  0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828,
>     elt=<optimized out>) at /src/vpp/src/vppinfra/tw_timer_template.c:154
> #1  tw_timer_stop_2t_1w_2048sl (
>     tw=0x7fffb0967728 <http_static_server_main+288>, handle=7306)
>     at /src/vpp/src/vppinfra/tw_timer_template.c:374
> ---Type <return> to continue, or q <return> to quit---
> #2  0x00007fffb076146f in http_static_server_session_timer_stop
> (hs=<optimized out>)
>     at /src/vpp/src/plugins/http_static/static_server.c:126
> #3  http_static_server_rx_tx_callback (s=0x7fffb5e13a40, cf=CALLED_FROM_RX)
>     at /src/vpp/src/plugins/http_static/static_server.c:1026
> #4  0x00007fffb0760eb8 in http_static_server_rx_callback (
>     s=0x7fffb0967728 <http_static_server_main+288>)
>     at /src/vpp/src/plugins/http_static/static_server.c:1037
> #5  0x00007ffff774a9de in app_worker_builtin_rx (app_wrk=<optimized out>,
> s=0x7fffb5e13a40)
>     at /src/vpp/src/vnet/session/application_worker.c:485
> #6  app_send_io_evt_rx (app_wrk=<optimized out>, s=0x7fffb5e13a40)
>     at /src/vpp/src/vnet/session/application_worker.c:691
> #7  0x00007ffff7713d9a in session_enqueue_notify_inline (s=0x7fffb5e13a40)
>     at /src/vpp/src/vnet/session/session.c:632
> #8  0x00007ffff7713fd1 in session_main_flush_enqueue_events
> (transport_proto=<optimized out>,
>     thread_index=0) at /src/vpp/src/vnet/session/session.c:736
> #9  0x00007ffff63960e9 in tcp46_established_inline (vm=0x7ffff5ddc6c0
> <vlib_global_main>,
>     node=<optimized out>, frame=<optimized out>, is_ip4=1) at
> /src/vpp/src/vnet/tcp/tcp_input.c:1558
> #10 tcp4_established_node_fn_hsw (vm=0x7ffff5ddc6c0 <vlib_global_main>,
> node=<optimized out>,
>     from_frame=0x7fffb5458480) at /src/vpp/src/vnet/tcp/tcp_input.c:1573
> #11 0x00007ffff5b5f509 in dispatch_node (vm=0x7ffff5ddc6c0
> <vlib_global_main>, node=0x7fffb4baf400,
>     type=VLIB_NODE_TYPE_INTERNAL, dispatch_state=VLIB_NODE_STATE_POLLING,
> frame=<optimized out>,
>     last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1194
> #12 dispatch_pending_node (vm=0x7ffff5ddc6c0 <vlib_global_main>,
> pending_frame_index=<optimized out>,
>     last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1353
> #13 vlib_main_or_worker_loop (vm=<optimized out>, is_main=1) at
> /src/vpp/src/vlib/main.c:1848
> #14 vlib_main_loop (vm=<optimized out>) at /src/vpp/src/vlib/main.c:1976
> #15 0x00007ffff5b5daf0 in vlib_main (vm=0x7ffff5ddc6c0 <vlib_global_main>,
> input=0x7fffb4762fb0)
>     at /src/vpp/src/vlib/main.c:2222
> #16 0x00007ffff5bc2816 in thread0 (arg=140737318340288) at
> /src/vpp/src/vlib/unix/main.c:660
> #17 0x00007ffff4fa9ec4 in clib_calljmp () from
> /usr/lib/x86_64-linux-gnu/libvppinfra.so.20.09
> #18 0x00007fffffffd8b0 in ?? ()
> #19 0x00007ffff5bc27c8 in vlib_unix_main (argc=<optimized out>,
> argv=<optimized out>)
>     at /src/vpp/src/vlib/unix/main.c:733
>
> [1]
> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-release-http_static-timer_remove.log
>
> On Thu, Jul 23, 2020 at 2:47 PM Ivan Shvedunov via lists.fd.io <ivan4th=
> gmail....@lists.fd.io> wrote:
>
>> Hi,
>> I've found a problem with the timer fix and commented in Gerrit [1]
>> accordingly.
>> Basically this change [2] makes the tcp_prepare_retransmit_segment()
>> issue go away for me.
>>
>> Concerning the proxy example, I can no longer see the SVM FIFO crashes,
>> but when using debug build, VPP crashes with this error (full log [3])
>> during my test:
>> /usr/bin/vpp[39]: /src/vpp/src/vnet/tcp/tcp_input.c:2857
>> (tcp46_input_inline) assertion `tcp_lookup_is_valid (tc1, b[1],
>> tcp_buffer_hdr (b[1]))' fails
>>
>> When using release build, it produces a lot of messages like this instead:
>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 15168 disp
>> error state CLOSE_WAIT flags 0x02 SYN
>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 9417 disp
>> error state FIN_WAIT_2 flags 0x12 SYN ACK
>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 10703 disp
>> error state TIME_WAIT flags 0x12 SYN ACK
>>
>> and also
>>
>> /usr/bin/vpp[39]: active_open_connected_callback:439: connection 85557
>> failed!
>>
>> [1] https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39
>> [2]
>> https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca
>> [3]
>> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71
>>
>> I will look into src/vcl/test/test_vcl.py to see if I can reproduce
>> something like my test there, thanks!
>> And waiting for Dave's input concerning the CSIT part, too, of course.
>>
>>
>> On Thu, Jul 23, 2020 at 5:22 AM Florin Coras <fcoras.li...@gmail.com>
>> wrote:
>>
>>> Hi Ivan,
>>>
>>> Thanks for the test. After modifying it a bit to run straight from
>>> binaries, I managed to repro the issue. As expected, the proxy is not
>>> cleaning up the sessions correctly (example apps do run out of sync ..).
>>> Here’s a quick patch that solves some of the obvious issues [1] (note that
>>> it’s chained with gerrit 27952). I didn’t do too much testing, so let me
>>> know if you hit some other problems. As far as I can tell, 27952 is needed.
>>>
>>> As for the CI, I guess there are two types of tests we might want
>>> (cc-ing Dave since he has experience with this):
>>> - functional test that could live as part of “make test” infra. The host
>>> stack already has some functional integration tests, i.e., the vcl tests in
>>> src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We could do
>>> something   similar for the proxy app, but the tests need to be lightweight
>>> as they’re run as part of the verify jobs
>>> - CSIT scale/performance tests. We could use something like your scripts
>>> to test the proxy but also ld_preload + nginx and other applications. Dave
>>> should have more opinions here :-)
>>>
>>> Regards,
>>> Florin
>>>
>>> [1] https://gerrit.fd.io/r/c/vpp/+/28041
>>>
>>> On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>>
>>> Concerning the CI: I'd be glad to add that test to "make test", but not
>>> sure how to approach it. The test is not about containers but more about
>>> using network namespaces and some tools like wrk to create a lot of TCP
>>> connections to do some "stress testing" of VPP host stack (and as it was
>>> noted, it fails not only on the proxy example, but also on http_static
>>> plugin). It's probably doable w/o any external tooling at all, and even
>>> without the network namespaces either, using only VPP's own TCP stack, but
>>> that is probably rather hard. Could you suggest some ideas how it could be
>>> added to "make test"? Should I add a `test_....py` under `tests/` that
>>> creates host interfaces in VPP and uses these via OS networking instead of
>>> the packet generator? As far as I can see there's something like that in
>>> srv6-mobile plugin [1].
>>>
>>> [1]
>>> https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125
>>>
>>> On Wed, Jul 22, 2020 at 8:25 PM Florin Coras <fcoras.li...@gmail.com>
>>> wrote:
>>>
>>>> I missed the point about the CI in my other reply. If we can somehow
>>>> integrate some container based tests into the “make test” infra, I wouldn’t
>>>> mind at all! :-)
>>>>
>>>> Regards,
>>>> Florin
>>>>
>>>> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>> sadly the patch apparently didn't work. It should have worked but for
>>>> some reason it didn't ...
>>>>
>>>> On the bright side, I've made a test case [1] using fresh upstream VPP
>>>> code with no UPF that reproduces the issues I mentioned, including both
>>>> timer and TCP retransmit one along with some other possible problems using
>>>> http_static plugin and the proxy example, along with nginx (with proxy) and
>>>> wrk.
>>>>
>>>> It is docker-based, but the main scripts (start.sh and test.sh) can be
>>>> used without Docker, too.
>>>> I've used our own Dockerfiles to build the images, but I'm not sure if
>>>> that makes any difference.
>>>> I've added some log files resulting from the runs that crashed in
>>>> different places. For me, the tests crash on each run, but in different
>>>> places.
>>>>
>>>> The TCP retransmit problem happens with http_static when using the
>>>> debug build. When using release build, some unrelated crash
>>>> in timer_remove() happens instead.
>>>> The SVM FIFO crash happens when using the proxy. It can happen with
>>>> both release and debug builds.
>>>>
>>>> Please see the repo [1] for details and crash logs.
>>>>
>>>> [1] https://github.com/ivan4th/vpp-tcp-test
>>>>
>>>> P.S. As the tests do expose some problems with VPP host stack and some
>>>> of the VPP plugins/examples, maybe we should consider adding them to the
>>>> VPP CI, too?
>>>>
>>>> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <fcoras.li...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ivan,
>>>>>
>>>>> Thanks for the detailed report!
>>>>>
>>>>> I assume this is a situation where most of the connections time out
>>>>> and the rate limiting we apply on the pending timer queue delays handling
>>>>> for long enough to be in a situation like the one you described. Here’s a
>>>>> draft patch that starts tracking pending timers [1]. Let me know if it
>>>>> solves the first problem.
>>>>>
>>>>> Regarding the second, it looks like the first chunk in the fifo is not
>>>>> properly initialized/corrupted. It’s hard to tell what leads to that given
>>>>> that I haven’t seen this sort of issues even with larger number of
>>>>> connections. You could maybe try calling svm_fifo_is_sane() in the
>>>>> enqueue/dequeue functions, or after the proxy allocates/shares the fifos 
>>>>> to
>>>>> catch the issue as early as possible.
>>>>>
>>>>> Regards,
>>>>> Florin
>>>>>
>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/27952
>>>>>
>>>>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com wrote:
>>>>>
>>>>>   Hi,
>>>>>   I'm working on the Travelping UPF project
>>>>> https://github.com/travelping/vpp  <https://github.com/travelping/vpp>For
>>>>> variety of reasons, it's presently maintained as a fork of UPF that's
>>>>> rebased on top of upstream master from time to time, but really it's just 
>>>>> a
>>>>> plugin. During 40K TCP connection test with netem, I found an issue with
>>>>> TCP timer race (timers firing after tcp_timer_reset() was called for them)
>>>>> which I tried to work around only to stumble into another crash, which I'm
>>>>> presently debugging (an SVM FIFO bug, possibly) but maybe some of you 
>>>>> folks
>>>>> have some ideas what it could be.
>>>>>   I've described my findings in this JIRA ticket:
>>>>> https://jira.fd.io/browse/VPP-1923
>>>>>   Although the last upstream commit UPF is presently based on 
>>>>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1)
>>>>> was some time ago, I believe  the problems are still relevant as
>>>>> there were no changes in these parts of code in master since that commit.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>>>> 2807
>>>>
>>>>
>>>>
>>>
>>> --
>>> Ivan Shvedunov <ivan...@gmail.com>
>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>>> 2807
>>>
>>>
>>>
>>
>> --
>> Ivan Shvedunov <ivan...@gmail.com>
>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>> 2807
>> 
>>
>
>
> --
> Ivan Shvedunov <ivan...@gmail.com>
> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
> 2807
>
>
>

-- 
Ivan Shvedunov <ivan...@gmail.com>
;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17067): https://lists.fd.io/g/vpp-dev/message/17067
Mute This Topic: https://lists.fd.io/mt/75537746/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to