Re: [vpp-dev] TCP timer race and another possible TCP issue

Ivan Shvedunov Wed, 29 Jul 2020 09:41:33 -0700

Hi Florin,

while trying to fix the proxy cleanup issue, I've spotted another problem
in the TCP stack, namely RSTs being ignored in SYN_SENT (half-open)
connection state:
https://gerrit.fd.io/r/c/vpp/+/28103


The following fix for handling failed active connections in the proxy has
worked for me, but please confirm that it's correct:
https://gerrit.fd.io/r/c/vpp/+/28104

After https://gerrit.fd.io/r/c/vpp/+/28096 , I no longer see any "disp
error" diagnostics for CLOSE_WAIT but I still do see "disp errors" for SYNs
in TIME_WAIT and FIN_WAIT_2 states. I didn't enable reorder in netem, the
config for it that I now use is this:
ip netns exec client tc qdisc add dev client root netem \
   loss 15% delay 100ms 300ms duplicate 10%

I tried to abandon my older proxy patch that just removes ASSERT (0), but
somehow I've lost the ability to log into fd.io Gerrit with my Linux
Foundation login/password, although I can still submit patches and also log
into identity.linuxfoundation.org just fine. Cc'ing Vanessa again here,
could you please help with this?


On Tue, Jul 28, 2020 at 9:27 PM Florin Coras <fcoras.li...@gmail.com> wrote:

> Hi Ivan,
>
> Inline.
>
> On Jul 28, 2020, at 8:45 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>
> Hi Florin,
> thanks, the fix has worked and http_static no longer crashes.
>
>
> Perfect, thanks for confirming!
>
>
> I still get a number of messages like this when using release build:
> /usr/bin/vpp[39]: state_sent_ok:954: BUG: couldn't send response header!
> Not sure if it's actually a bug or the queue being actually full because
> of the packet loss being caused by netem.
>
>
> Not sure. Either that or there’s some issue with the send logic in some
> corner case. If you have hsm->debug_level > 0 you should get a warning when
> the enqueue fails because there’s no space left for sending. Note that we
> probably need some logic to retry sending if enqueue fails (fill in usual
> disclaimer that these are example test apps, unfortunately). I can try to
> provide some guidance if anybody feels like tackling that specific problem.
>
>
> There are also numerous messages like
> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2817: tcp conn 13836 disp
> error state CLOSE_WAIT flags 0x02 SYN
> but I guess that's caused by a large number of connections from the same
> client => port reuse and, again, the packet loss.
>
>
> Are you also inducing reorder?
>
> It seems the client closes the connection cleanly (fin) and before that is
> completed, which as a first step would require a close from the app and a
> move to last-ack, it sends a syn. Alternatively, reordering makes it look
> that way. Those warnings are not harmful per se, because tcp will just drop
> the packets, but we could try processing the syns to maybe force resets. If
> you have the time to do it, try this [1]
>
>
> I've spotted another crash in the proxy example for which I've submitted a
> patch:
> https://gerrit.fd.io/r/c/vpp/+/28085
>
> One problem with this fix is that the proxy example is actually
> incomplete, as the client connection should be dropped in case if the proxy
> can't establish the corresponding outgoing connection (e.g. timeout or RST
> b/c of "connection refused"), whereas currently the client-side connection
> just hangs instead. I'll take a stab at fixing that as our proxy in UPF
> also has problems with handling similar situations and it would be easier
> for me to fix it in this simpler example first. But, in any case, crashing
> VPP is probably much worse than just letting the connection hang, so maybe
> we should remove that ASSERT after all.
>
>
> Exactly. The assert there is just a reminder to eventually do the proper
> cleanup. Thanks for looking into it!
>
> Regards,
> Florin
>
> [1] https://gerrit.fd.io/r/c/vpp/+/28096
>
>
>
> On Mon, Jul 27, 2020 at 10:59 PM Florin Coras <fcoras.li...@gmail.com>
> wrote:
>
>> Hi Ivan,
>>
>> Took a look at the static http server and, as far as I can tell, it has
>> the same type of issue the proxy had, i.e., premature session
>> cleanup/reuse. Does this solve the problem for you [1]?
>>
>> Also, merged your elog fix patch. Thanks!
>>
>> Regards,
>> Florin
>>
>> [1]  https://gerrit.fd.io/r/c/vpp/+/28076
>>
>> On Jul 27, 2020, at 10:22 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>
>> Hi.
>> I've debugged http server issue a bit more and here are my observations:
>> if I add an ASSERT(0) in the place of "No http session for thread 0
>> session_index 54",
>> I get stack trace along the lines of
>> Program received signal SIGABRT, Aborted.
>> 0x00007ffff470bf47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
>> #0  0x00007ffff470bf47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
>> #1  0x00007ffff470d8b1 in abort () from /lib/x86_64-linux-gnu/libc.so.6
>> #2  0x0000000000407193 in os_panic () at /src/vpp/src/vpp/vnet/main.c:371
>> #3  0x00007ffff55ec619 in debugger () at /src/vpp/src/vppinfra/error.c:84
>> #4  0x00007ffff55ec397 in _clib_error (how_to_die=2, function_name=0x0,
>> line_number=0,
>>     fmt=0x7fffb0e1324b "%s:%d (%s) assertion `%s' fails")
>>     at /src/vpp/src/vppinfra/error.c:143
>> #5  0x00007fffb0e0cc3d in http_static_server_rx_tx_callback
>> (s=0x7fffb66447c0,
>>     cf=CALLED_FROM_RX) at
>> /src/vpp/src/plugins/http_static/static_server.c:1030
>> #6  0x00007fffb0e0b437 in http_static_server_rx_callback
>> (s=0x7fffb66447c0)
>>     at /src/vpp/src/plugins/http_static/static_server.c:1056
>> #7  0x00007ffff745a94e in app_worker_builtin_rx (app_wrk=0x7fffb59c7e00,
>>     s=0x7fffb66447c0) at
>> /src/vpp/src/vnet/session/application_worker.c:485
>> #8  0x00007ffff745c00e in app_send_io_evt_rx (app_wrk=0x7fffb59c7e00,
>> s=0x7fffb66447c0)
>>     at /src/vpp/src/vnet/session/application_worker.c:691
>> #9  0x00007ffff745b34e in app_worker_lock_and_send_event
>> (app=0x7fffb59c7e00,
>>     s=0x7fffb66447c0, evt_type=0 '\000')
>>     at /src/vpp/src/vnet/session/application_worker.c:779
>> #10 0x00007ffff73eda5e in session_enqueue_notify_inline (s=0x7fffb66447c0)
>>     at /src/vpp/src/vnet/session/session.c:632
>> #11 0x00007ffff73eddbf in session_main_flush_enqueue_events
>> (transport_proto=0 '\000',
>>     thread_index=0) at /src/vpp/src/vnet/session/session.c:736
>> #12 0x00007ffff680c2d4 in tcp46_rcv_process_inline (
>>     vm=0x7ffff6477b80 <vlib_global_main>, node=0x7fffb522a680,
>>     from_frame=0x7fffb5a05f80, is_ip4=1) at
>> /src/vpp/src/vnet/tcp/tcp_input.c:2478
>> #13 0x00007ffff680b5ea in tcp4_rcv_process_node_fn_hsw
>>
>> (line numbers maybe slightly off due to debug print).
>> If I further dump the connection elog for such sessions I get
>> /usr/bin/vpp[39]: http_static_server_rx_tx_callback:1028: ZZZZZ:
>> connection elog:
>>        92.494240039: close: cidx 52
>>        92.494241379: fin-tx: snd_nxt 1573 rcv_nxt 235
>>        92.494241431: state: fin-wait-1
>>        92.495088424: out: flags 11, bytes 32
>>        92.922939929: timer-pop: retransmit cidx 52 sidx 54
>>        92.922978305: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961
>> snd_wnd 64128
>>        92.922978440: rcv_stat: rto 344 srtt 80 mrtt-us 39371 rttvar 66
>>        92.922978491: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt 262
>>        92.922978711: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0
>> flight 1
>>        92.922980365: fin-tx: snd_nxt 1573 rcv_nxt 235
>>        92.923492112: out: flags 11, bytes 32
>>        93.711846499: timer-pop: retransmit cidx 52 sidx 54
>>        93.735094287: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961
>> snd_wnd 64128
>>        93.735094392: rcv_stat: rto 688 srtt 80 mrtt-us 39371 rttvar 66
>>        93.735094448: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt 1
>>        93.735094532: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0
>> flight 1
>>        93.735095835: fin-tx: snd_nxt 1573 rcv_nxt 235
>>        93.735582660: out: flags 11, bytes 32
>>        94.568121707: timer-pop: retransmit cidx 52 sidx 54
>>        94.584413829: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961
>> snd_wnd 64128
>>        94.584414153: rcv_stat: rto 1376 srtt 80 mrtt-us 39371 rttvar 66
>>        94.584414259: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt
>> 8192
>>        94.584414333: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0
>> flight 1
>>        94.584417424: fin-tx: snd_nxt 1573 rcv_nxt 235
>>        94.585024738: out: flags 11, bytes 32
>>        95.999473091: timer-pop: retransmit cidx 52 sidx 54
>>        96.088472363: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961
>> snd_wnd 64128
>>        96.088472647: rcv_stat: rto 2752 srtt 80 mrtt-us 39371 rttvar 66
>>        96.088472758: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt
>> 8192
>>        96.088472842: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0
>> flight 1
>>        96.088475171: fin-tx: snd_nxt 1573 rcv_nxt 235
>>        96.126505063: out: flags 11, bytes 32
>>        98.861064635: timer-pop: retransmit cidx 52 sidx 54
>>        98.898809592: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961
>> snd_wnd 64128
>>        98.898809726: rcv_stat: rto 5504 srtt 80 mrtt-us 39371 rttvar 66
>>        98.898809768: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt
>> 8192
>>        98.898809828: cc: rxt-timeout snd_space 2233 snd_una 1573 out 0
>> flight 1
>>        98.898811259: fin-tx: snd_nxt 1573 rcv_nxt 235
>>        98.899350729: out: flags 11, bytes 32
>>       104.181435961: cc_stat: cwnd 2234 flight 1 space 1448 ssthresh 3961
>> snd_wnd 64128
>>       104.181436051: rcv_stat: rto 11008 srtt 80 mrtt-us 39371 rttvar 66
>>       104.181436102: snd_stat: cc_space 2233 sacked 0 lost 0 out 0 rxt
>> 55986
>>       104.181436602: ack-rx: 1 snd_una 1574 snd_wnd 64128 cwnd 2234
>> inflight 0
>>       104.181437352: state: fin-wait-2
>>       104.181438642: in: order len 39 written 39 rcv_nxt 274 rcv_wnd(o)
>> 8192
>>       104.181439438: fin-rx: snd_nxt 1574 rcv_nxt 274
>>       104.181439499: state: time-wait
>>
>> which, along with some added debug print, indicates
>> that http_static_server_rx_callback() on a connection that was already
>> closed and for which http_static_server_session_disconnect_callback() was
>> already called.
>> Looks like some of the enqueued data is pushed
>> via session_main_flush_enqueue_events() after the disconnect callback. At
>> this point, I'm not sure how that should be fixed, but probably it should
>> be as invoking rx callback after disconnect one doesn't seem to be the
>> expected behaviour.
>>
>> I also noticed another problem with uninitialized elog track with TCP
>> debugging enabled for which I submitted a fix (Gerrit worked for me this
>> time), basically that TCP_EVT (TCP_EVT_SYN_RCVD, child, 1); that was moved
>> in one of previous fixes must be moved even further up, as in case of
>> failed session_stream_accept() another elog crash will happen
>> in tcp_connection_cleanup() if the elog track is not initialized.
>> The patch is here: https://gerrit.fd.io/r/c/vpp/+/28074
>>
>> As of the timer crash in http_static in case of the release build, I'm a
>> bit unsure but it looks like http_static_server_session_timer_stop() is not
>> idempotent, as it doesn't reset the timer handle. If you think this is a
>> correct guess, I can re-verify this patch and submit it.
>>
>> diff --git a/src/plugins/http_static/static_server.c
>> b/src/plugins/http_static/static_server.c
>> index 45d8731af..7772a9ec0 100644
>> --- a/src/plugins/http_static/static_server.c
>> +++ b/src/plugins/http_static/static_server.c
>> @@ -124,6 +124,7 @@ http_static_server_session_timer_stop (http_session_t
>> * hs)
>>      return;
>>    clib_spinlock_lock (&http_static_server_main.tw_lock);
>>    tw_timer_stop_2t_1w_2048sl (&http_static_server_main.tw,
>> hs->timer_handle);
>> +  hs->timer_handle = ~0;
>>    clib_spinlock_unlock (&http_static_server_main.tw_lock);
>>  }
>>
>>
>>
>> On Sat, Jul 25, 2020 at 12:13 AM Florin Coras <fcoras.li...@gmail.com>
>> wrote:
>>
>>> Great! Thanks for confirming!
>>>
>>> Let me know how it goes with the static http server.
>>>
>>> Cheers,
>>> Florin
>>>
>>> On Jul 24, 2020, at 2:00 PM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>>
>>> Hi Florin,
>>> I re-verified the patches and the modified patch doesn't crash either,
>>> so I think it's safe to merge it.
>>> Thanks!
>>>
>>> I will try to see what is the remaining problem with http_static
>>>
>>> On Fri, Jul 24, 2020 at 8:15 PM Florin Coras <fcoras.li...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ivan,
>>>>
>>>> Adding Vanessa to see if she can help with the account issues.
>>>>
>>>> Thanks a lot for the patches! Pushed them here [1] and [2]. I took the
>>>> liberty of slightly changing [2], so if you get a chance, do try it out
>>>> again.
>>>>
>>>> Finally, the static http server still needs fixes. Most probably it
>>>> mishandles cleanups and corrupts the sessions/connections under heavy load.
>>>>
>>>> Regards,
>>>> Florin
>>>>
>>>> [1] https://gerrit.fd.io/r/c/vpp/+/28063
>>>> [2] https://gerrit.fd.io/r/c/vpp/+/28064
>>>>
>>>>
>>>> On Jul 24, 2020, at 9:27 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>>>
>>>> I did a bit more debugging and found an issue that was causing
>>>> invalid TCP connection lookups.
>>>> Basically, if session_connected_callback was failing for an app (in
>>>> case of proxy, e.g. b/c the other corresponding connection got closed), it
>>>> was leaving an invalid entry in the session lookup table.
>>>> Another less serious issue that I found while debugging this issue is
>>>> that when TCP elog is enabled in tcp_debug.h, the TCP stack crashes upon
>>>> accepting connections because of misplaced TCP_EVT.
>>>> After fixing these issues, my test no longer crashes with the proxy
>>>> example. That being said, http_static plugin still crashes with the timer
>>>> error.
>>>>
>>>> I tried to submit both patches to fd.io Gerrit, but as I didn't submit
>>>> any CRs to this Gerrit before, I ran into a problem with my old email in
>>>> Linux Foundation account settings. I've fixed that and then updated the
>>>> email address in fd.io Gerrit too, but still getting the same error:
>>>> remote: ERROR: commit e98791a: email address ivan...@gmail.com is not
>>>> registered in your account, and you lack 'forge committer' permission.
>>>>
>>>> Because of these issues I'm attaching both patches here for now.
>>>>
>>>> On Thu, Jul 23, 2020 at 6:47 PM Florin Coras <fcoras.li...@gmail.com>
>>>> wrote:
>>>>
>>>>> Ah, I didn’t try running test.sh 80. The only difference in how I’m
>>>>> running the test is that I start vpp outside of start.sh straight from
>>>>> binaries.
>>>>>
>>>>> Regards,
>>>>> Florin
>>>>>
>>>>> On Jul 23, 2020, at 8:22 AM, Ivan Shvedunov <ivan...@gmail.com> wrote:
>>>>>
>>>>> Well, I always run the same test, the difference being only
>>>>> "test.sh 80" for http_static (it's configured to be listening on that
>>>>> port)
>>>>> or just "test.sh" for the proxy. As far as I understand, you run the
>>>>> tests without using the containers, does that include setting up netem 
>>>>> like
>>>>> this [1] ?
>>>>>
>>>>> [1]
>>>>> https://github.com/ivan4th/vpp-tcp-test/blob/a3b02ec/start.sh#L34-L35
>>>>>
>>>>> On Thu, Jul 23, 2020 at 5:10 PM Florin Coras <fcoras.li...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ivan,
>>>>>>
>>>>>> Updated [1] but I’m not seeing [3] after several test iterations.
>>>>>>
>>>>>> Probably the static server needs the same treatment as the proxy. Are
>>>>>> you running a slightly different test? All of the builtin apps have the
>>>>>> potential to crash vpp or leave the host stack in an unwanted state since
>>>>>> they run inline.
>>>>>>
>>>>>> Either way, to solve this, first step would be to get rid of error
>>>>>> like, “no http session for thread 0 session_index x”. Will eventually try
>>>>>> to look into it if nobody beats me to it.
>>>>>>
>>>>>> Regards,
>>>>>> Florin
>>>>>>
>>>>>> On Jul 23, 2020, at 4:59 AM, Ivan Shvedunov <ivan...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> http_static produces some errors:
>>>>>> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http
>>>>>> session for thread 0 session_index 4124
>>>>>> /usr/bin/vpp[40]: http_static_server_rx_tx_callback:1010: No http
>>>>>> session for thread 0 session_index 4124
>>>>>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp
>>>>>> error state CLOSE_WAIT flags 0x02 SYN
>>>>>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13658 disp
>>>>>> error state CLOSE_WAIT flags 0x02 SYN
>>>>>> /usr/bin/vpp[40]: tcp_input_dispatch_buffer:2812: tcp conn 13350 disp
>>>>>> error state CLOSE_WAIT flags 0x02 SYN
>>>>>>
>>>>>> also with multiple different CP connection states related to
>>>>>> connections being closed and receiving SYN / SYN+ACK.
>>>>>> The release build crashes (did already happen before, so it's
>>>>>> unrelated to any of the fixes [1]):
>>>>>>
>>>>>> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response
>>>>>> header!
>>>>>> /usr/bin/vpp[39]: state_sent_ok:973: BUG: couldn't send response
>>>>>> header!
>>>>>>
>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>> 0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828,
>>>>>> elt=<optimized out>)
>>>>>>     at /src/vpp/src/vppinfra/tw_timer_template.c:154
>>>>>> 154     /src/vpp/src/vppinfra/tw_timer_template.c: No such file or
>>>>>> directory.
>>>>>> #0  0x00007ffff4fdcfb9 in timer_remove (pool=0x7fffb56b6828,
>>>>>>     elt=<optimized out>) at
>>>>>> /src/vpp/src/vppinfra/tw_timer_template.c:154
>>>>>> #1  tw_timer_stop_2t_1w_2048sl (
>>>>>>     tw=0x7fffb0967728 <http_static_server_main+288>, handle=7306)
>>>>>>     at /src/vpp/src/vppinfra/tw_timer_template.c:374
>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>> #2  0x00007fffb076146f in http_static_server_session_timer_stop
>>>>>> (hs=<optimized out>)
>>>>>>     at /src/vpp/src/plugins/http_static/static_server.c:126
>>>>>> #3  http_static_server_rx_tx_callback (s=0x7fffb5e13a40,
>>>>>> cf=CALLED_FROM_RX)
>>>>>>     at /src/vpp/src/plugins/http_static/static_server.c:1026
>>>>>> #4  0x00007fffb0760eb8 in http_static_server_rx_callback (
>>>>>>     s=0x7fffb0967728 <http_static_server_main+288>)
>>>>>>     at /src/vpp/src/plugins/http_static/static_server.c:1037
>>>>>> #5  0x00007ffff774a9de in app_worker_builtin_rx (app_wrk=<optimized
>>>>>> out>, s=0x7fffb5e13a40)
>>>>>>     at /src/vpp/src/vnet/session/application_worker.c:485
>>>>>> #6  app_send_io_evt_rx (app_wrk=<optimized out>, s=0x7fffb5e13a40)
>>>>>>     at /src/vpp/src/vnet/session/application_worker.c:691
>>>>>> #7  0x00007ffff7713d9a in session_enqueue_notify_inline
>>>>>> (s=0x7fffb5e13a40)
>>>>>>     at /src/vpp/src/vnet/session/session.c:632
>>>>>> #8  0x00007ffff7713fd1 in session_main_flush_enqueue_events
>>>>>> (transport_proto=<optimized out>,
>>>>>>     thread_index=0) at /src/vpp/src/vnet/session/session.c:736
>>>>>> #9  0x00007ffff63960e9 in tcp46_established_inline (vm=0x7ffff5ddc6c0
>>>>>> <vlib_global_main>,
>>>>>>     node=<optimized out>, frame=<optimized out>, is_ip4=1) at
>>>>>> /src/vpp/src/vnet/tcp/tcp_input.c:1558
>>>>>> #10 tcp4_established_node_fn_hsw (vm=0x7ffff5ddc6c0
>>>>>> <vlib_global_main>, node=<optimized out>,
>>>>>>     from_frame=0x7fffb5458480) at
>>>>>> /src/vpp/src/vnet/tcp/tcp_input.c:1573
>>>>>> #11 0x00007ffff5b5f509 in dispatch_node (vm=0x7ffff5ddc6c0
>>>>>> <vlib_global_main>, node=0x7fffb4baf400,
>>>>>>     type=VLIB_NODE_TYPE_INTERNAL,
>>>>>> dispatch_state=VLIB_NODE_STATE_POLLING, frame=<optimized out>,
>>>>>>     last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1194
>>>>>> #12 dispatch_pending_node (vm=0x7ffff5ddc6c0 <vlib_global_main>,
>>>>>> pending_frame_index=<optimized out>,
>>>>>>     last_time_stamp=<optimized out>) at /src/vpp/src/vlib/main.c:1353
>>>>>> #13 vlib_main_or_worker_loop (vm=<optimized out>, is_main=1) at
>>>>>> /src/vpp/src/vlib/main.c:1848
>>>>>> #14 vlib_main_loop (vm=<optimized out>) at
>>>>>> /src/vpp/src/vlib/main.c:1976
>>>>>> #15 0x00007ffff5b5daf0 in vlib_main (vm=0x7ffff5ddc6c0
>>>>>> <vlib_global_main>, input=0x7fffb4762fb0)
>>>>>>     at /src/vpp/src/vlib/main.c:2222
>>>>>> #16 0x00007ffff5bc2816 in thread0 (arg=140737318340288) at
>>>>>> /src/vpp/src/vlib/unix/main.c:660
>>>>>> #17 0x00007ffff4fa9ec4 in clib_calljmp () from
>>>>>> /usr/lib/x86_64-linux-gnu/libvppinfra.so.20.09
>>>>>> #18 0x00007fffffffd8b0 in ?? ()
>>>>>> #19 0x00007ffff5bc27c8 in vlib_unix_main (argc=<optimized out>,
>>>>>> argv=<optimized out>)
>>>>>>     at /src/vpp/src/vlib/unix/main.c:733
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-release-http_static-timer_remove.log
>>>>>>
>>>>>> On Thu, Jul 23, 2020 at 2:47 PM Ivan Shvedunov via lists.fd.io
>>>>>> <ivan4th=gmail....@lists.fd.io> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I've found a problem with the timer fix and commented in Gerrit [1]
>>>>>>> accordingly.
>>>>>>> Basically this change [2] makes the tcp_prepare_retransmit_segment()
>>>>>>> issue go away for me.
>>>>>>>
>>>>>>> Concerning the proxy example, I can no longer see the SVM FIFO
>>>>>>> crashes, but when using debug build, VPP crashes with this error (full 
>>>>>>> log
>>>>>>> [3]) during my test:
>>>>>>> /usr/bin/vpp[39]: /src/vpp/src/vnet/tcp/tcp_input.c:2857
>>>>>>> (tcp46_input_inline) assertion `tcp_lookup_is_valid (tc1, b[1],
>>>>>>> tcp_buffer_hdr (b[1]))' fails
>>>>>>>
>>>>>>> When using release build, it produces a lot of messages like this
>>>>>>> instead:
>>>>>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 15168
>>>>>>> disp error state CLOSE_WAIT flags 0x02 SYN
>>>>>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 9417 disp
>>>>>>> error state FIN_WAIT_2 flags 0x12 SYN ACK
>>>>>>> /usr/bin/vpp[39]: tcp_input_dispatch_buffer:2812: tcp conn 10703
>>>>>>> disp error state TIME_WAIT flags 0x12 SYN ACK
>>>>>>>
>>>>>>> and also
>>>>>>>
>>>>>>> /usr/bin/vpp[39]: active_open_connected_callback:439: connection
>>>>>>> 85557 failed!
>>>>>>>
>>>>>>> [1]
>>>>>>> https://gerrit.fd.io/r/c/vpp/+/27952/4/src/vnet/tcp/tcp_timer.h#39
>>>>>>> [2]
>>>>>>> https://github.com/travelping/vpp/commit/04512323f311ceebfda351672372033b567d37ca
>>>>>>> [3]
>>>>>>> https://github.com/ivan4th/vpp-tcp-test/blob/master/logs/crash-debug-proxy-tcp_lookup_is_valid.log#L71
>>>>>>>
>>>>>>> I will look into src/vcl/test/test_vcl.py to see if I can reproduce
>>>>>>> something like my test there, thanks!
>>>>>>> And waiting for Dave's input concerning the CSIT part, too, of
>>>>>>> course.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 23, 2020 at 5:22 AM Florin Coras <fcoras.li...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Ivan,
>>>>>>>>
>>>>>>>> Thanks for the test. After modifying it a bit to run straight from
>>>>>>>> binaries, I managed to repro the issue. As expected, the proxy is not
>>>>>>>> cleaning up the sessions correctly (example apps do run out of sync 
>>>>>>>> ..).
>>>>>>>> Here’s a quick patch that solves some of the obvious issues [1] (note 
>>>>>>>> that
>>>>>>>> it’s chained with gerrit 27952). I didn’t do too much testing, so let 
>>>>>>>> me
>>>>>>>> know if you hit some other problems. As far as I can tell, 27952 is 
>>>>>>>> needed.
>>>>>>>>
>>>>>>>> As for the CI, I guess there are two types of tests we might want
>>>>>>>> (cc-ing Dave since he has experience with this):
>>>>>>>> - functional test that could live as part of “make test” infra. The
>>>>>>>> host stack already has some functional integration tests, i.e., the vcl
>>>>>>>> tests in src/vcl/test/test_vcl.py (quic, tls, tcp also have some). We 
>>>>>>>> could
>>>>>>>> do something   similar for the proxy app, but the tests need to be
>>>>>>>> lightweight as they’re run as part of the verify jobs
>>>>>>>> - CSIT scale/performance tests. We could use something like your
>>>>>>>> scripts to test the proxy but also ld_preload + nginx and other
>>>>>>>> applications. Dave should have more opinions here :-)
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Florin
>>>>>>>>
>>>>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/28041
>>>>>>>>
>>>>>>>> On Jul 22, 2020, at 1:18 PM, Ivan Shvedunov <ivan...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Concerning the CI: I'd be glad to add that test to "make test", but
>>>>>>>> not sure how to approach it. The test is not about containers but more
>>>>>>>> about using network namespaces and some tools like wrk to create a lot 
>>>>>>>> of
>>>>>>>> TCP connections to do some "stress testing" of VPP host stack (and as 
>>>>>>>> it
>>>>>>>> was noted, it fails not only on the proxy example, but also on 
>>>>>>>> http_static
>>>>>>>> plugin). It's probably doable w/o any external tooling at all, and even
>>>>>>>> without the network namespaces either, using only VPP's own TCP stack, 
>>>>>>>> but
>>>>>>>> that is probably rather hard. Could you suggest some ideas how it 
>>>>>>>> could be
>>>>>>>> added to "make test"? Should I add a `test_....py` under `tests/` that
>>>>>>>> creates host interfaces in VPP and uses these via OS networking 
>>>>>>>> instead of
>>>>>>>> the packet generator? As far as I can see there's something like that 
>>>>>>>> in
>>>>>>>> srv6-mobile plugin [1].
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/travelping/vpp/blob/feature/2005/upf/src/plugins/srv6-mobile/extra/runner.py#L125
>>>>>>>>
>>>>>>>> On Wed, Jul 22, 2020 at 8:25 PM Florin Coras <
>>>>>>>> fcoras.li...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I missed the point about the CI in my other reply. If we can
>>>>>>>>> somehow integrate some container based tests into the “make test” 
>>>>>>>>> infra, I
>>>>>>>>> wouldn’t mind at all! :-)
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Florin
>>>>>>>>>
>>>>>>>>> On Jul 22, 2020, at 4:17 AM, Ivan Shvedunov <ivan...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> sadly the patch apparently didn't work. It should have worked but
>>>>>>>>> for some reason it didn't ...
>>>>>>>>>
>>>>>>>>> On the bright side, I've made a test case [1] using fresh upstream
>>>>>>>>> VPP code with no UPF that reproduces the issues I mentioned, 
>>>>>>>>> including both
>>>>>>>>> timer and TCP retransmit one along with some other possible problems 
>>>>>>>>> using
>>>>>>>>> http_static plugin and the proxy example, along with nginx (with 
>>>>>>>>> proxy) and
>>>>>>>>> wrk.
>>>>>>>>>
>>>>>>>>> It is docker-based, but the main scripts (start.sh and test.sh)
>>>>>>>>> can be used without Docker, too.
>>>>>>>>> I've used our own Dockerfiles to build the images, but I'm not
>>>>>>>>> sure if that makes any difference.
>>>>>>>>> I've added some log files resulting from the runs that crashed in
>>>>>>>>> different places. For me, the tests crash on each run, but in 
>>>>>>>>> different
>>>>>>>>> places.
>>>>>>>>>
>>>>>>>>> The TCP retransmit problem happens with http_static when using the
>>>>>>>>> debug build. When using release build, some unrelated crash
>>>>>>>>> in timer_remove() happens instead.
>>>>>>>>> The SVM FIFO crash happens when using the proxy. It can happen
>>>>>>>>> with both release and debug builds.
>>>>>>>>>
>>>>>>>>> Please see the repo [1] for details and crash logs.
>>>>>>>>>
>>>>>>>>> [1] https://github.com/ivan4th/vpp-tcp-test
>>>>>>>>>
>>>>>>>>> P.S. As the tests do expose some problems with VPP host stack and
>>>>>>>>> some of the VPP plugins/examples, maybe we should consider adding 
>>>>>>>>> them to
>>>>>>>>> the VPP CI, too?
>>>>>>>>>
>>>>>>>>> On Thu, Jul 16, 2020 at 8:33 PM Florin Coras <
>>>>>>>>> fcoras.li...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Ivan,
>>>>>>>>>>
>>>>>>>>>> Thanks for the detailed report!
>>>>>>>>>>
>>>>>>>>>> I assume this is a situation where most of the connections time
>>>>>>>>>> out and the rate limiting we apply on the pending timer queue delays
>>>>>>>>>> handling for long enough to be in a situation like the one you 
>>>>>>>>>> described.
>>>>>>>>>> Here’s a draft patch that starts tracking pending timers [1]. Let me 
>>>>>>>>>> know
>>>>>>>>>> if it solves the first problem.
>>>>>>>>>>
>>>>>>>>>> Regarding the second, it looks like the first chunk in the fifo
>>>>>>>>>> is not properly initialized/corrupted. It’s hard to tell what leads 
>>>>>>>>>> to that
>>>>>>>>>> given that I haven’t seen this sort of issues even with larger 
>>>>>>>>>> number of
>>>>>>>>>> connections. You could maybe try calling svm_fifo_is_sane() in the
>>>>>>>>>> enqueue/dequeue functions, or after the proxy allocates/shares the 
>>>>>>>>>> fifos to
>>>>>>>>>> catch the issue as early as possible.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Florin
>>>>>>>>>>
>>>>>>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/27952
>>>>>>>>>>
>>>>>>>>>> On Jul 16, 2020, at 2:03 AM, ivan...@gmail.com wrote:
>>>>>>>>>>
>>>>>>>>>>   Hi,
>>>>>>>>>>   I'm working on the Travelping UPF project
>>>>>>>>>> https://github.com/travelping/vpp
>>>>>>>>>> <https://github.com/travelping/vpp>For variety of reasons, it's
>>>>>>>>>> presently maintained as a fork of UPF that's rebased on top of 
>>>>>>>>>> upstream
>>>>>>>>>> master from time to time, but really it's just a plugin. During 40K 
>>>>>>>>>> TCP
>>>>>>>>>> connection test with netem, I found an issue with TCP timer race 
>>>>>>>>>> (timers
>>>>>>>>>> firing after tcp_timer_reset() was called for them) which I tried to 
>>>>>>>>>> work
>>>>>>>>>> around only to stumble into another crash, which I'm presently 
>>>>>>>>>> debugging
>>>>>>>>>> (an SVM FIFO bug, possibly) but maybe some of you folks have some 
>>>>>>>>>> ideas
>>>>>>>>>> what it could be.
>>>>>>>>>>   I've described my findings in this JIRA ticket:
>>>>>>>>>> https://jira.fd.io/browse/VPP-1923
>>>>>>>>>>   Although the last upstream commit UPF is presently based on 
>>>>>>>>>> (afc233aa93c3f23b30b756cb4ae2967f968bbbb1)
>>>>>>>>>> was some time ago, I believe  the problems are still relevant as
>>>>>>>>>> there were no changes in these parts of code in master since that 
>>>>>>>>>> commit.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>>>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8
>>>>>>>>> 0BC5 2807
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8
>>>>>>>> 0BC5 2807
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8
>>>>>>> 0BC5 2807
>>>>>>> 
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8
>>>>>> 0BC5 2807
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8
>>>>> 0BC5 2807
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ivan Shvedunov <ivan...@gmail.com>
>>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>>>> 2807
>>>> <0002-Fix-initializing-elog-track-for-accepted-TCP-connect.patch>
>>>> <0001-Fix-bad-session-lookups-after-failed-session_connect.patch>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Ivan Shvedunov <ivan...@gmail.com>
>>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>>> 2807
>>>
>>>
>>>
>>
>> --
>> Ivan Shvedunov <ivan...@gmail.com>
>> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
>> 2807
>>
>>
>>
>
> --
> Ivan Shvedunov <ivan...@gmail.com>
> ;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5
> 2807
>
>
>

-- 
Ivan Shvedunov <ivan...@gmail.com>
;; My GPG fingerprint is: 2E61 0748 8E12 BB1A 5AB9  F7D0 613E C0F8 0BC5 2807

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17108): https://lists.fd.io/g/vpp-dev/message/17108
Mute This Topic: https://lists.fd.io/mt/75537746/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] TCP timer race and another possible TCP issue

Reply via email to