Hi Florin,
I re-verified the patches and the modified patch doesn't crash either, so I
think it's safe to merge it.
Thanks!

I will try to see what is the remaining problem with http_static

On Fri, Jul 24, 2020 at 8:15 PM Florin Coras <fcoras.li...@gmail.com> wrote:

> Hi Ivan,
>
> Adding Vanessa to see if she can help with the account issues.
>
> Thanks a lot for the patches! Pushed them here [1] and [2]. I took the
> liberty of slightly changing [2], so if you get a chance, do try it out
> again.
>
> Finally, the static http server still needs fixes. Most probably it
> mishandles cleanups and corrupts the sessions/connections under heavy load.
>
> Regards,
> Florin
>
> [1] https://gerrit.fd.io/r/c/vpp/+/28063
> [2] https://gerrit.fd.io/r/c/vpp/+/28064
>
>
> On Jul 24, 2020, at 9:27 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>
> I did a bit more debugging and found an issue that was causing invalid TCP
> connection lookups.
> Basically, if session_connected_callback was failing for an app (in case
> of proxy, e.g. b/c the other corresponding connection got closed), it was
> leaving an invalid entry in the session lookup table.
> Another less serious issue that I found while debugging this issue is that
> when TCP elog is enabled in tcp_debug.h, the TCP stack crashes upon
> accepting connections because of misplaced TCP_EVT.
> After fixing these issues, my test no longer crashes with the proxy
> example. That being said, http_static plugin still crashes with the timer
> error.
>
> I tried to submit both patches to fd.io Gerrit, but as I didn't submit
> any CRs to this Gerrit before, I ran into a problem with my old email in
> Linux Foundation account settings. I've fixed that and then updated the
> email address in fd.io Gerrit too, but still getting the same error:
> remote: ERROR: commit e98791a: email address ivan...@gmail.com is not
> registered in your account, and you lack 'forge committer' permission.
>
> Because of these issues I'm attaching both patches here for now.
>
> On Thu, Jul 23, 2020 at 6:47 PM Florin Coras <fcoras.li...@gmail.com>
> wrote:
>
>> Ah, I didn’t try running test.sh 80. The only difference in how I’m
>> running the test is that I start vpp outside of start.sh straight from
>> binaries.
>>
>> Regards,
>> Florin
>>
>> On Jul 23, 2020, at 8:22 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>
>> Well, I always run the same test, the difference being only
>> "test.sh 80" for http_static (it's configured to be listening on that
>> port)
>> or just "test.sh" for the proxy. As far as I understand, you run the
>> tests without using the containers, does that include setting up netem like
>> this [1] ?
>>
>> [1] https://github.com/ivan4th/vpp-tcp-test/blob/a3b02ec/start.sh#L34-L35
>>
>> On Thu, Jul 23, 2020 at 5:10 PM Florin Coras <fcoras.li...@gmail.com>
>> wrote:
>>
>>> Hi Ivan,
>>>
>>> Updated [1] but I’m not seeing [3] after several test iterations.
>>>
>>> Probably the static server needs the same treatment as the proxy. Are
>>> you running a slightly different test? All of the builtin apps have the
>>> potential to crash vpp or leave the host stack in an unwanted state since
>>> they run inline.
>>>
>>> Either way, to solve this, first step would be to get rid of error like,
>>> “no http session for thread 0 session_index x”. Will eventually try to look
>>> into it if nobody beats me to it.
>>>
>>> Regards,
>>> Florin
>>>
>>> On Jul 23, 2020, at 4:59 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>>
>>> http_static produces some errors:
>>> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http
>>> session for thread 0 session_index 4124
>>> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http
>>> session for thread 0 session_index 4124
>>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp
>>> error state CLOSE_WAIT flags 0x02 SYN
>>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp
>>> error state CLOSE_WAIT flags 0x02 SYN
>>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13350 disp
>>> error state CLOSE_WAIT flags 0x02 SYN
>>>
>>> also with multiple different CP connection states related to connections
>>> being closed and receiving SYN / SYN+ACK.
>>> The release build crashes (did already happen before, so it's unrelated
>>> to any of the fixes [1]):
>>>
>>> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header!
>>> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response header!
>>>
>>> Program received signal SIGSEGV, Segmentation fault.
>>> 0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828, elt=<optimized
>>> out>)
>>>     at /src/vpp/src/vppinfra/tw_timer_template.c:154
>>> 154     /src/vpp/src/vppinfra/tw_timer_template.c: No such file or
>>> directory.
>>> #0  0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828,
>>>     elt=<optimized out>) at /src/vpp/src/vppinfra/tw_timer_template.c:154
>>> #1  tw_timer_stop_2t_1w_2048sl (
>>>     tw=0x7fffb0967728 <http_static_server_main+288>, handle=7306)
>>>     at /src/vpp/src/vppinfra/tw_timer_template.c:374
>>> ---Type <return> to continue, or q <return> to quit---
>>> #2  0x00007fffb076146f in http_static_server_session_timer_stop
>>> (hs=<optimized out>)
>>>     at /src/vpp/src/plugins/http_static/static_server.c:126
>>> #3  http_static_server_rx_tx_callback (s=0x7fffb5e13a40,
>>> cf=CALLED_FROM_RX)
>>>     at /src/vpp/src/plugins/http_static/static_server.c:1026
>>> #4  0x00007fffb0760eb8 in http_static_server_rx_callback (
>>>     s=0x7fffb0967728 <http_static_server_main+288>)
>>>     at /src/vpp/src/plugins/http_static/static_server.c:1037
>>> #5  0x00007ffff774a9de in app_worker_builtin_rx (app_wrk=<optimized
>>> out>, s=0x7fffb5e13a40)
>>>     at /src/vpp/src/vnet/session/application_worker.c:485
>>> #6  app_send_io_evt_rx (app_wrk=<optimized out>, s=0x7fffb5e13a40)
>>>     at /src/vpp/src/vnet/session/application_worker.c:691
>>> #7  0x00007ffff7713d9a in session_enqueue_notify_inline
>>> (s=0x7fffb5e13a40)
>>>     at /src/vpp/src/vnet/session/session.c:632
>>> #8  0x00007ffff7713fd1 in session_main_flush_enqueue_events
>>> (transport_proto=<optimized out>,
>>>     thread_index=0) at /src/vpp/src/vnet/session/session.c:736
>>> #9  0x00007ffff63960e9 in tcp46_established_inline (vm=0x7ffff5ddc6c0
>>> <vlib_global_main>,
>>>     node=<optimized out>, frame=<optimized out>, is_ip4=1) at
>>> /src/vpp/src/vnet/tcp/tcp_input.c:1558
>>> #10 tcp4_established_node_fn_hsw (vm=0x7ffff5ddc6c0 <vlib_global_main>,
>>> node=<optimized out>,
>>>     from_frame=0x7fffb5458480) at /src/vpp/src/vnet/tcp/tcp_input.c:1573
>>> #11 0x00007ffff5b5f509 in dispatch_node (vm=0x7ffff5ddc6c0
>>> <vlib_global_main>, node=0x7fffb4baf400,
>>>     type=VLIB_NODE_TYPE_INTERNAL,
>>> dispatch_state=VLIB_NODE_STATE_POLLING, frame=<optimized out>,
>>>     last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1194
>>> #12 dispatch_pending_node (vm=0x7ffff5ddc6c0 <vlib_global_main>,
>>> pending_frame_index=<optimized out>,
>>>     last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1353
>>> #13 vlib_main_or_worker_loop (vm=<optimized out>, is_main=1) at
>>> /src/vpp/src/vlib/main.c:1848
>>> #14 vlib_main_loop (vm=<optimized out>) at /src/vpp/src/vlib/main.c:1976
>>> #15 0x00007ffff5b5daf0 in vlib_main (vm=0x7ffff5ddc6c0
>>> <vlib_global_main>, input=0x7fffb4762fb0)
>>>     at /src/vpp/src/vlib/main.c:2222
>>> #16 0x00007ffff5bc2816 in thread0 (arg=140737318340288) at
>>> /src/vpp/src/vlib/unix/main.c:660
>>> #17 0x00007ffff4fa9ec4 in clib_calljmp () from
>>> /usr/lib/x86_64-linux-gnu/libvppinfra.so.20.09
>>> #18 0x00007fffffffd8b0 in ?? ()
>>> #19 0x00007ffff5bc27c8 in vlib_unix_main (argc=<optimized out>,
>>> argv=<optimized out>)
>>>     at /src/vpp/src/vlib/unix/main.c:733
>>>
>>> [1]
>>> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-release-http_static-timer_remove.log
>>>
>>> On Thu, Jul 23, 2020 at 2:47 PM Ivan Shvedunov via lists.fd.io <ivan4th=
>>> gmail....@lists.fd.io> wrote:
>>>
>>>> Hi,
>>>> I've found a problem with the timer fix and commented in Gerrit [1]
>>>> accordingly.
>>>> Basically this change [2] makes the tcp_prepare_retransmit_segment()
>>>> issue go away for me.
>>>>
>>>> Concerning the proxy example, I can no longer see the SVM FIFO crashes,
>>>> but when using debug build, VPP crashes with this error (full log [3])
>>>> during my test:
>>>> /usr/bin/vpp[39]: /src/vpp/src/vnet/tcp/tcp_input.c:2857
>>>> (tcp46_input_inline) assertion `tcp_lookup_is_valid (tc1, b[1],
>>>> tcp_buffer_hdr (b[1]))' fails
>>>>
>>>> When using release build, it produces a lot of messages like this
>>>> instead:
>>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 15168 disp
>>>> error state CLOSE_WAIT flags 0x02 SYN
>>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 9417 disp
>>>> error state FIN_WAIT_2 flags 0x12 SYN ACK
>>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 10703 disp
>>>> error state TIME_WAIT flags 0x12 SYN ACK
>>>>
>>>> and also
>>>>
>>>> /usr/bin/vpp[39]: active_open_connected_callback:439: connection 85557
>>>> failed!
>>>>
>>>> [1] https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39
>>>> [2]
>>>> https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca
>>>> [3]
>>>> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71
>>>>
>>>> I will look into src/vcl/test/test_vcl.py to see if I can reproduce
>>>> something like my test there, thanks!
>>>> And waiting for Dave's input concerning the CSIT part, too, of course.
>>>>
>>>>
>>>> On Thu, Jul 23, 2020 at 5:22 AM Florin Coras <fcoras.li...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ivan,
>>>>>
>>>>> Thanks for the test. After modifying it a bit to run straight from
>>>>> binaries, I managed to repro the issue. As expected, the proxy is not
>>>>> cleaning up the sessions correctly (example apps do run out of sync ..).
>>>>> Here’s a quick patch that solves some of the obvious issues [1] (note that
>>>>> it’s chained with gerrit 27952). I didn’t do too much testing, so let me
>>>>> know if you hit some other problems. As far as I can tell, 27952 is 
>>>>> needed.
>>>>>
>>>>> As for the CI, I guess there are two types of tests we might want
>>>>> (cc-ing Dave since he has experience with this):
>>>>> - functional test that could live as part of “make test” infra. The
>>>>> host stack already has some functional integration tests, i.e., the vcl
>>>>> tests in src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We 
>>>>> could
>>>>> do something   similar for the proxy app, but the tests need to be
>>>>> lightweight as they’re run as part of the verify jobs
>>>>> - CSIT scale/performance tests. We could use something like your
>>>>> scripts to test the proxy but also ld_preload + nginx and other
>>>>> applications. Dave should have more opinions here :-)
>>>>>
>>>>> Regards,
>>>>> Florin
>>>>>
>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/28041
>>>>>
>>>>> On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>>>>
>>>>> Concerning the CI: I'd be glad to add that test to "make test", but
>>>>> not sure how to approach it. The test is not about containers but more
>>>>> about using network namespaces and some tools like wrk to create a lot of
>>>>> TCP connections to do some "stress testing" of VPP host stack (and as it
>>>>> was noted, it fails not only on the proxy example, but also on http_static
>>>>> plugin). It's probably doable w/o any external tooling at all, and even
>>>>> without the network namespaces either, using only VPP's own TCP stack, but
>>>>> that is probably rather hard. Could you suggest some ideas how it could be
>>>>> added to "make test"? Should I add a `test_....py` under `tests/` that
>>>>> creates host interfaces in VPP and uses these via OS networking instead of
>>>>> the packet generator? As far as I can see there's something like that in
>>>>> srv6-mobile plugin [1].
>>>>>
>>>>> [1]
>>>>> https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125
>>>>>
>>>>> On Wed, Jul 22, 2020 at 8:25 PM Florin Coras <fcoras.li...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I missed the point about the CI in my other reply. If we can somehow
>>>>>> integrate some container based tests into the “make test” infra, I 
>>>>>> wouldn’t
>>>>>> mind at all! :-)
>>>>>>
>>>>>> Regards,
>>>>>> Florin
>>>>>>
>>>>>> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>> sadly the patch apparently didn't work. It should have worked but for
>>>>>> some reason it didn't ...
>>>>>>
>>>>>> On the bright side, I've made a test case [1] using fresh upstream
>>>>>> VPP code with no UPF that reproduces the issues I mentioned, including 
>>>>>> both
>>>>>> timer and TCP retransmit one along with some other possible problems 
>>>>>> using
>>>>>> http_static plugin and the proxy example, along with nginx (with proxy) 
>>>>>> and
>>>>>> wrk.
>>>>>>
>>>>>> It is docker-based, but the main scripts (start.sh and test.sh) can
>>>>>> be used without Docker, too.
>>>>>> I've used our own Dockerfiles to build the images, but I'm not sure
>>>>>> if that makes any difference.
>>>>>> I've added some log files resulting from the runs that crashed in
>>>>>> different places. For me, the tests crash on each run, but in different
>>>>>> places.
>>>>>>
>>>>>> The TCP retransmit problem happens with http_static when using the
>>>>>> debug build. When using release build, some unrelated crash
>>>>>> in timer_remove() happens instead.
>>>>>> The SVM FIFO crash happens when using the proxy. It can happen with
>>>>>> both release and debug builds.
>>>>>>
>>>>>> Please see the repo [1] for details and crash logs.
>>>>>>
>>>>>> [1] https://github.com/ivan4th/vpp-tcp-test
>>>>>>
>>>>>> P.S. As the tests do expose some problems with VPP host stack and
>>>>>> some of the VPP plugins/examples, maybe we should consider adding them to
>>>>>> the VPP CI, too?
>>>>>>
>>>>>> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <fcoras.li...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ivan,
>>>>>>>
>>>>>>> Thanks for the detailed report!
>>>>>>>
>>>>>>> I assume this is a situation where most of the connections time out
>>>>>>> and the rate limiting we apply on the pending timer queue delays 
>>>>>>> handling
>>>>>>> for long enough to be in a situation like the one you described. Here’s 
>>>>>>> a
>>>>>>> draft patch that starts tracking pending timers [1]. Let me know if it
>>>>>>> solves the first problem.
>>>>>>>
>>>>>>> Regarding the second, it looks like the first chunk in the fifo is
>>>>>>> not properly initialized/corrupted. It’s hard to tell what leads to that
>>>>>>> given that I haven’t seen this sort of issues even with larger number of
>>>>>>> connections. You could maybe try calling svm_fifo_is_sane() in the
>>>>>>> enqueue/dequeue functions, or after the proxy allocates/shares the 
>>>>>>> fifos to
>>>>>>> catch the issue as early as possible.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Florin
>>>>>>>
>>>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/27952
>>>>>>>
>>>>>>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com wrote:
>>>>>>>
>>>>>>>   Hi,
>>>>>>>   I'm working on the Travelping UPF project
>>>>>>> https://github.com/travelping/vpp
>>>>>>> <https://github.com/travelping/vpp>For variety of reasons, it's
>>>>>>> presently maintained as a fork of UPF that's rebased on top of upstream
>>>>>>> master from time to time, but really it's just a plugin. During 40K TCP
>>>>>>> connection test with netem, I found an issue with TCP timer race (timers
>>>>>>> firing after tcp_timer_reset() was called for them) which I tried to 
>>>>>>> work
>>>>>>> around only to stumble into another crash, which I'm presently debugging
>>>>>>> (an SVM FIFO bug, possibly) but maybe some of you folks have some ideas
>>>>>>> what it could be.
>>>>>>>   I've described my findings in this JIRA ticket:
>>>>>>> https://jira.fd.io/browse/VPP-1923
>>>>>>>   Although the last upstream commit UPF is presently based on 
>>>>>>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1)
>>>>>>> was some time ago, I believe  the problems are still relevant as
>>>>>>> there were no changes in these parts of code in master since that 
>>>>>>> commit.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8
>>>>>> 0BC5 2807
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8
>>>>> 0BC5 2807
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>>>> 2807
>>>> 
>>>>
>>>
>>>
>>> --
>>> Ivan Shvedunov <ivan...@gmail.com>
>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>>> 2807
>>>
>>>
>>>
>>
>> --
>> Ivan Shvedunov <ivan...@gmail.com>
>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>> 2807
>>
>>
>>
>
> --
> Ivan Shvedunov <ivan...@gmail.com>
> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
> 2807
> <0002-Fix-initializing-elog-track-for-accepted-TCP-connect.patch>
> <0001-Fix-bad-session-lookups-after-failed-session_connect.patch>
>
>
>

-- 
Ivan Shvedunov <ivan...@gmail.com>
;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17085): https://lists.fd.io/g/vpp-dev/message/17085
Mute This Topic: https://lists.fd.io/mt/75537746/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to