Hi vpp-dev, I'm seeing a crash when I enable our application with multiple works. Nov 26 14:29:32 vnet[64035]: received signal SIGSEGV, PC 0x7f6979a12ce8, faulting address 0x7fa6cd0bd444 Nov 26 14:29:32 vnet[64035]: #0 0x00007f6a812743d8 0x7f6a812743d8 Nov 26 14:29:32 vnet[64035]: #1 0x00007f6a80bc56d0 0x7f6a80bc56d0 Nov 26 14:29:32 vnet[64035]: #2 0x00007f6979a12ce8 vlib_frame_vector_args + 0x10 Nov 26 14:29:32 vnet[64035]: #3 0x00007f6979a16a2c tcpo_enqueue_to_output_i + 0xf4 Nov 26 14:29:32 vnet[64035]: #4 0x00007f6979a16b23 tcpo_enqueue_to_output + 0x25 Nov 26 14:29:32 vnet[64035]: #5 0x00007f6979a33fba send_packets + 0x7f2 Nov 26 14:29:32 vnet[64035]: #6 0x00007f6979a346f8 connection_tx + 0x17e Nov 26 14:29:32 vnet[64035]: #7 0x00007f6979a34f08 tcpo_dispatch_node_fn + 0x7fa Nov 26 14:29:32 vnet[64035]: #8 0x00007f6a81248cb6 vlib_worker_loop + 0x6a6 Nov 26 14:29:32 vnet[64035]: #9 0x00007f6a8094f694 0x7f6a8094f694
Running on CentOS 7.4 with kernel 3.10.0-693.el7.x86_64 VPP Version: v18.10-13~g00adcce~b60 Compiled by: root Compile host: b0f32e97e93a Compile date: Mon Nov 26 09:09:42 UTC 2018 Compile location: /w/workspace/vpp-merge-1810-centos7 Compiler: GCC 7.3.1 20180303 (Red Hat 7.3.1-5) Current PID: 9612 On a Cisco server with 2 socket Intel Xeon E5-2697Av4 @ 2.60GHz and 2 Intel X520 NICs. T-Rex traffic generator is hooked up on the other end to provided data at about 5Gbps per NIC. ./t-rex-64 --astf -f astf/nginx_wget.py -c 14 -m 40000 -d 3000 startup.conf unix { nodaemon interactive log /opt/tcpo/logs/vpp.log full-coredump cli-no-banner #startup-config /opt/tcpo/conf/local.conf cli-listen /run/vpp/cli.sock } api-trace { on } heapsize 3G cpu { main-core 1 corelist-workers 2-5 } tcpo { runtime-config /opt/tcpo/conf/runtime.conf session-pool-size 1024000 } dpdk { dev 0000:86:00.0 { num-rx-queues 1 } dev 0000:86:00.1 { num-rx-queues 1 } dev 0000:84:00.0 { num-rx-queues 1 } dev 0000:84:00.1 { num-rx-queues 1 } num-mbufs 1024000 socket-mem 4096,4096 } plugin_path /usr/lib/vpp_plugins api-segment { gid vpp } Here's the function where the SIGSEGV is happening: static void enqueue_to_output_i ( tcpo_worker_ctx_t * wrk, u32 bi, u8 flush) { u32 * to_next, next_index; vlib_frame_t * f; TRACE_FUNC_VAR (bi); next_index = tcpo_output_node. index ; /* Get frame to output node */ f = wrk-> tx_frame ; if ( ! f) { f = vlib_get_frame_to_node (wrk-> vm , next_index); ASSERT ( clib_mem_is_heap_object (f)); wrk-> tx_frame = f; } ASSERT ( clib_mem_is_heap_object (f)); to_next = vlib_frame_vector_args (f); to_next[f-> n_vectors ] = bi; f-> n_vectors += 1 ; if (flush || f-> n_vectors == VLIB_FRAME_SIZE) { TRACE_FUNC_VAR2 (flush, f-> n_vectors ); vlib_put_frame_to_node (wrk-> vm , next_index, f); wrk-> tx_frame = 0 ; } } I've observed that after a few Gbps of traffic go through and we call *vlib_get_frame_to_node* the pointer *f* that gets returned points to a chunk of memory that is invalid as confirmed by the assert statement that I added afterwards right below. Not sure how to progress further on tracking down this issue, any help or advice would be much appreciated. Thanks, Hugo
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#11438): https://lists.fd.io/g/vpp-dev/message/11438 Mute This Topic: https://lists.fd.io/mt/28408842/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-