Hi Dave, Am Mi., 3. Juli 2019 um 14:17 Uhr schrieb Dave Barach (dbarach) < dbar...@cisco.com>:
> Dear Andreas, > > > > Single thread vs. multiple workers? > We have intentionally limited this to one CPU, it therefore can't be a concurrent process doing something Debug image? > So far the problem has been observed only in release images under load. I've been unable to replicate the problem with artificial tests or on a debug image. vm->heap_aligned_base matches reality? > Not sure what that means. How do I check that? (Virtual address of allocated frame - vm->heap_aligned_base) / > CLIB_CACHE_LINE_BYTES fits in 32 bits? > I'm doing a: vlib_buffer_alloc (vm, &bi0, 1); b0 = vlib_get_buffer (vm, bi0); right before the crash. gdb tells me that bi0 is 94976. Isn't that a bit too large? b0 is optimised out, so I can't tell its value. In vlib/main.c:vlib_frame_alloc_to_node(...) try replacing > vlib_frame_index_no_check(vm, f) with vlib_frame_index(vm, f) in a debug > image. > Will do. > Again, best I can do to help w/ next-to-no information. > The problem is, I don't know what information will be useful and how to extract it. I have a core file and can dig into some internal structures. But which ones are helpful? Anyway, I'm grateful for any pointers. Regards Andreas > > D. > > > > *From:* vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> *On Behalf Of *Andreas > Schultz > *Sent:* Wednesday, July 3, 2019 4:47 AM > *To:* Dave Barach (dbarach) <dbar...@cisco.com> > *Cc:* Hugo Garza <hu...@opanga.com>; vpp-dev@lists.fd.io > *Subject:* Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node > > > > Hi, > > > > I've run into the same issue with different, but also external code. > > > > The calling sequence in my case looks very similar to the one from Hugo. > I'm also getting a invalid point from vlib_get_frame_to_node. > > It is crashing here: > https://github.com/travelping/vpp/blob/feature/master/upf%2Btdf/src/plugins/upf/upf_pfcp_server.c#L121 > > > > @Hugo: have you found the root cause for your problem? > > > > Regards > > Andreas > > > > Am Mi., 28. Nov. 2018 um 12:53 Uhr schrieb Dave Barach via Lists.Fd.Io > <dbarach=cisco....@lists.fd.io>: > > None of the routine names in the backtrace exist in master/latest – it’s > your code - so it will be challenging for the community to help you. > > > > See if you can repro the problem with a TAG=vpp_debug images (aka “make > build” not “make build-release”). If you’re lucky, one of the numerous > ASSERTs will catch the problem early. > > > > vlib_get_frame_to_node(...) is not new code, it’s used all over the place, > and it needs “help” to fail as shown below. > > > > D. > > > > *From:* vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> *On Behalf Of *Hugo > Garza > *Sent:* Tuesday, November 27, 2018 7:39 PM > *To:* vpp-dev@lists.fd.io > *Subject:* [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node > > > > Hi vpp-dev, > > I'm seeing a crash when I enable our application with multiple works. > Nov 26 14:29:32 vnet[64035]: received signal SIGSEGV, PC 0x7f6979a12ce8, > faulting address 0x7fa6cd0bd444 > Nov 26 14:29:32 vnet[64035]: #0 0x00007f6a812743d8 0x7f6a812743d8 > Nov 26 14:29:32 vnet[64035]: #1 0x00007f6a80bc56d0 0x7f6a80bc56d0 > Nov 26 14:29:32 vnet[64035]: #2 0x00007f6979a12ce8 > vlib_frame_vector_args + 0x10 > Nov 26 14:29:32 vnet[64035]: #3 0x00007f6979a16a2c > tcpo_enqueue_to_output_i + 0xf4 > Nov 26 14:29:32 vnet[64035]: #4 0x00007f6979a16b23 > tcpo_enqueue_to_output + 0x25 > Nov 26 14:29:32 vnet[64035]: #5 0x00007f6979a33fba send_packets + 0x7f2 > Nov 26 14:29:32 vnet[64035]: #6 0x00007f6979a346f8 connection_tx + 0x17e > Nov 26 14:29:32 vnet[64035]: #7 0x00007f6979a34f08 tcpo_dispatch_node_fn > + 0x7fa > Nov 26 14:29:32 vnet[64035]: #8 0x00007f6a81248cb6 vlib_worker_loop + > 0x6a6 > Nov 26 14:29:32 vnet[64035]: #9 0x00007f6a8094f694 0x7f6a8094f694 > > Running on CentOS 7.4 with kernel 3.10.0-693.el7.x86_64 > VPP > Version: v18.10-13~g00adcce~b60 > Compiled by: root > Compile host: b0f32e97e93a > Compile date: Mon Nov 26 09:09:42 UTC 2018 > Compile location: /w/workspace/vpp-merge-1810-centos7 > Compiler: GCC 7.3.1 20180303 (Red Hat 7.3.1-5) > Current PID: 9612 > > On a Cisco server with 2 socket Intel Xeon E5-2697Av4 @ 2.60GHz and 2 > Intel X520 NICs. T-Rex traffic generator is hooked up on the other end to > provided data at about 5Gbps per NIC. > ./t-rex-64 --astf -f astf/nginx_wget.py -c 14 -m 40000 -d 3000 > > startup.conf > unix { > nodaemon > interactive > log /opt/tcpo/logs/vpp.log > full-coredump > cli-no-banner > #startup-config /opt/tcpo/conf/local.conf > cli-listen /run/vpp/cli.sock > } > api-trace { > on > } > heapsize 3G > cpu { > main-core 1 > corelist-workers 2-5 > } > tcpo { > runtime-config /opt/tcpo/conf/runtime.conf > session-pool-size 1024000 > } > dpdk { > dev 0000:86:00.0 { > num-rx-queues 1 > } > dev 0000:86:00.1 { > num-rx-queues 1 > } > dev 0000:84:00.0 { > num-rx-queues 1 > } > dev 0000:84:00.1 { > num-rx-queues 1 > } > num-mbufs 1024000 > socket-mem 4096,4096 > } > plugin_path /usr/lib/vpp_plugins > api-segment { > gid vpp > } > > Here's the function where the SIGSEGV is happening: > > > > static void enqueue_to_output_i(tcpo_worker_ctx_t * wrk, u32 bi, u8 > flush) { > > > > u32 *to_next, next_index; > > > > vlib_frame_t *f; > > > > > > TRACE_FUNC_VAR(bi); > > > > > > next_index = tcpo_output_node.index; > > > > > > /* Get frame to output node */ > > > > f = wrk->tx_frame; > > > > if (!f) { > > > > f = vlib_get_frame_to_node(wrk->vm, next_index); > > > > ASSERT (clib_mem_is_heap_object (f)); > > > > wrk->tx_frame = f; > > > > } > > > > ASSERT (clib_mem_is_heap_object (f)); > > > > > > to_next = vlib_frame_vector_args(f); > > > > to_next[f->n_vectors] = bi; > > > > f->n_vectors += 1; > > > > > > if (flush || f->n_vectors == VLIB_FRAME_SIZE) { > > > > TRACE_FUNC_VAR2(flush, f->n_vectors); > > > > vlib_put_frame_to_node(wrk->vm, next_index, f); > > > > wrk->tx_frame = 0; > > > > } > > > > } > > > > > I've observed that after a few Gbps of traffic go through and we call > *vlib_get_frame_to_node* the pointer *f* that gets returned points to a > chunk of memory that is invalid as confirmed by the assert statement that I > added afterwards right below. > > Not sure how to progress further on tracking down this issue, any help or > advice would be much appreciated. > > Thanks, > Hugo > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > > View/Reply Online (#11444): https://lists.fd.io/g/vpp-dev/message/11444 > Mute This Topic: https://lists.fd.io/mt/28408842/675601 > Group Owner: vpp-dev+ow...@lists.fd.io > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [ > andreas.schu...@travelping.com] > -=-=-=-=-=-=-=-=-=-=-=- > > > > > -- > > Andreas Schultz > > -- > > Principal Engineer > > t: +49 391 819099-224 > > ------------------------------- enabling your networks > ----------------------------- > > Travelping GmbH > > Roentgenstraße 13 > > 39108 Magdeburg > > Germany > > t: +49 391 819099-0 > > f: +49 391 819099-299 > > e: i...@travelping.com > > w: https://www.travelping.com/ > > > > > > Company registration: Amtsgericht Stendal > > Reg. No.: HRB 10578 > > Geschaeftsfuehrer: Holger Winkelmann > > VAT ID: DE236673780 > > > -- Andreas Schultz -- Principal Engineer t: +49 391 819099-224 ------------------------------- enabling your networks ----------------------------- Travelping GmbH Roentgenstraße 13 39108 Magdeburg Germany t: +49 391 819099-0 f: +49 391 819099-299 e: i...@travelping.com w: https://www.travelping.com/ Company registration: Amtsgericht Stendal Reg. No.: HRB 10578 Geschaeftsfuehrer: Holger Winkelmann VAT ID: DE236673780
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#13433): https://lists.fd.io/g/vpp-dev/message/13433 Mute This Topic: https://lists.fd.io/mt/28408842/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-