Hi vpp-dev,

I'm seeing a crash when I enable our application with multiple works.
Nov 26 14:29:32  vnet[64035]: received signal SIGSEGV, PC 0x7f6979a12ce8, 
faulting address 0x7fa6cd0bd444
Nov 26 14:29:32  vnet[64035]: #0  0x00007f6a812743d8 0x7f6a812743d8
Nov 26 14:29:32  vnet[64035]: #1  0x00007f6a80bc56d0 0x7f6a80bc56d0
Nov 26 14:29:32  vnet[64035]: #2  0x00007f6979a12ce8 vlib_frame_vector_args + 
0x10
Nov 26 14:29:32  vnet[64035]: #3  0x00007f6979a16a2c tcpo_enqueue_to_output_i + 
0xf4
Nov 26 14:29:32  vnet[64035]: #4  0x00007f6979a16b23 tcpo_enqueue_to_output + 
0x25
Nov 26 14:29:32  vnet[64035]: #5  0x00007f6979a33fba send_packets + 0x7f2
Nov 26 14:29:32  vnet[64035]: #6  0x00007f6979a346f8 connection_tx + 0x17e
Nov 26 14:29:32  vnet[64035]: #7  0x00007f6979a34f08 tcpo_dispatch_node_fn + 
0x7fa
Nov 26 14:29:32  vnet[64035]: #8  0x00007f6a81248cb6 vlib_worker_loop + 0x6a6
Nov 26 14:29:32  vnet[64035]: #9  0x00007f6a8094f694 0x7f6a8094f694

Running on CentOS 7.4  with kernel 3.10.0-693.el7.x86_64
VPP
Version:                  v18.10-13~g00adcce~b60
Compiled by:              root
Compile host:             b0f32e97e93a
Compile date:             Mon Nov 26 09:09:42 UTC 2018
Compile location:         /w/workspace/vpp-merge-1810-centos7
Compiler:                 GCC 7.3.1 20180303 (Red Hat 7.3.1-5)
Current PID:              9612

On a Cisco server with 2 socket Intel Xeon E5-2697Av4 @ 2.60GHz and 2 Intel 
X520 NICs. T-Rex traffic generator is hooked up on the other end to provided 
data at about 5Gbps per NIC.
./t-rex-64 --astf -f astf/nginx_wget.py -c 14 -m 40000 -d 3000

startup.conf
unix {
  nodaemon
  interactive
  log /opt/tcpo/logs/vpp.log
  full-coredump
  cli-no-banner
  #startup-config /opt/tcpo/conf/local.conf
  cli-listen /run/vpp/cli.sock
}
api-trace {
  on
}
heapsize 3G
cpu {
  main-core 1
  corelist-workers 2-5
}
tcpo {
runtime-config /opt/tcpo/conf/runtime.conf
session-pool-size 1024000
}
dpdk {
  dev 0000:86:00.0 {
    num-rx-queues 1
  }
  dev 0000:86:00.1 {
    num-rx-queues 1
  }
  dev 0000:84:00.0 {
    num-rx-queues 1
  }
  dev 0000:84:00.1 {
    num-rx-queues 1
  }
  num-mbufs 1024000
  socket-mem 4096,4096
}
plugin_path /usr/lib/vpp_plugins
api-segment {
  gid vpp
}

Here's the function where the SIGSEGV is happening:
static void enqueue_to_output_i ( tcpo_worker_ctx_t * wrk, u32 bi, u8 flush) {
    u32 * to_next, next_index;
    vlib_frame_t * f;

    TRACE_FUNC_VAR (bi);

    next_index = tcpo_output_node. index ;

    /* Get frame to output node */
    f = wrk-> tx_frame ;
    if ( ! f) {
        f = vlib_get_frame_to_node (wrk-> vm , next_index);
        ASSERT ( clib_mem_is_heap_object (f));
        wrk-> tx_frame = f;
    }
    ASSERT ( clib_mem_is_heap_object (f));

    to_next = vlib_frame_vector_args (f);
    to_next[f-> n_vectors ] = bi;
    f-> n_vectors += 1 ;

    if (flush || f-> n_vectors == VLIB_FRAME_SIZE) {
        TRACE_FUNC_VAR2 (flush, f-> n_vectors );
        vlib_put_frame_to_node (wrk-> vm , next_index, f);
        wrk-> tx_frame = 0 ;
    }
}

I've observed that after a few Gbps of traffic go through and we call 
*vlib_get_frame_to_node* the pointer *f* that gets returned points to a chunk 
of memory that is invalid as confirmed by the assert statement that I added 
afterwards right below.

Not sure how to progress further on tracking down this issue, any help or 
advice would be much appreciated.

Thanks,
Hugo
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#11438): https://lists.fd.io/g/vpp-dev/message/11438
Mute This Topic: https://lists.fd.io/mt/28408842/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to