Greetings again,

Here is more context on the problem I'm seeing.
The problem occurs if a large-ish number of IPv4 prefixes are added to the
FIB (by way of the netlink and router plugin).

If the prefix count is below some threshold (e.g. 50,000 prefixes), things
work fine.
At some prefix count (haven't narrowed it down to a specific number, but I
don't think the actual number is relevant), vnet crashes, in a failure
within ip4_mtrie.c.

I have been trying to run in debug mode, but am having a lot of difficulty
building everything with debug.
Basically, the only way I can successfully build everything is to use the
script vagrant/build.sh (which does a make pkg-rpm that generates a bunch
of rpm files that I then install with yum).
Then, I have to rebuild things using the instructions from
vppsb/router/README.md (doing 4 symlinks and various make iterations, and
THEN having to run some of those with a bunch of CFLAGS values just to get
it to compile).

I don't see any good/easy way to build debug images from this environment,
without a LOT of work/investigation on how all the various build components
work.

Is the problem easy enough to diagnose from a non-symbolic stack dump, or
can someone provide details on how to build and run vpp with everything to
use gdb, including the plugins for netlink/router, so the problem can be
further isolated?

I think there's basically some kind of bug related to the fib stuff in
vnet, that really needs to be fixed.

The box has an unreasonably large amount of memory (128GB, doing nothing
but VPP), and I get the same error even if I up the initial heap size by a
factor of 2^12 (changing 32<<20 to 32ULL<<32).

Please help.

Brian

(In the following, the buffer space message is likely a consequence of the
thread handling netlink messages dying, rather than a cause.)
Here's the log messages:

> Dec  4 17:08:14 sj2tldnslab09 vnet[19785]: dpdk_pool_create:535:
> ioctl(VFIO_IOMMU_MAP_DMA) pool 'dpdk_mbuf_pool_socket0': Inappropriate
> ioctl for device (errno 25)
>
> Dec  4 17:08:14 sj2tldnslab09 vnet[19785]: dpdk_ipsec_process:1026: not
> enough DPDK crypto resources, default to OpenSSL
>
> Dec  4 17:08:16 sj2tldnslab09 vnet[19785]: rtnl_ns_recv:403: Received
> notification while in sync. Restart synchronization.
>
> Dec  4 17:08:16 sj2tldnslab09 vnet[19785]: rtnl_process_read:467:
> rtnetlink recv error (31) []: Bad file descriptor
>
> Dec  4 17:08:58 sj2tldnslab09 vnet[19785]: rtnl_process_read:467:
> rtnetlink recv error (27) []: No buffer space available
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: rtnl_process_read:467:
> rtnetlink recv error (27) []: No buffer space available
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: received signal SIGABRT, PC
> 0x7f043c3c7277
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #0  0x00007f043e5c18c5
> 0x7f043e5c18c5
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #1  0x00007f043c9716d0
> 0x7f043c9716d0
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #2  0x00007f043c3c7277 gsignal
> + 0x37
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #3  0x00007f043c3c8968 abort +
> 0x148
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #4  0x00005569eb7900d3
> 0x5569eb7900d3
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #5  0x00007f043d0e8512
> vec_resize_allocate_memory + 0x2f2
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #6  0x00007f043dd9809f
> 0x7f043dd9809f
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #7  0x00007f043dd985cd
> ip4_fib_mtrie_route_add + 0x17d
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #8  0x00007f043e129b08
> fib_entry_src_action_install + 0xb8
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #9  0x00007f043e1274a0
> fib_entry_create + 0x70
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #10 0x00007f043e11e890
> fib_table_entry_path_add2 + 0x190
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #11 0x00007f03f86833fd
> add_del_route + 0x34c
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #12 0x00007f03f8683594
> netns_notify_cb + 0x8c
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #13 0x00007f03f8466e71
> netns_notify + 0x1f3
>
> Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #14 0x00007f03f84684ed
> ns_rcv_route + 0x825
>

On Tue, Nov 27, 2018 at 6:17 PM Brian Dickson <brian.peter.dick...@gmail.com>
wrote:

> I have been working with the netlink and router plugins, which I was able
> to build from the 18.07 tree via the instructions in vppsb/router.
>
> (NB: trying to build from anything more recent, e.g. 18.10 or 19.01
> breaks, with no obvious easy resolution).
>
> When running with these plugins, connected with an open source router
> (bird version 1.6.4 or 2.02) and with a very small routing table, it works
> really really well.
>
> (I was able to run roughly line-rate 10g even with small packets, and when
> using a second host with vpp and the span->pg->pcap to /tmp, didn't lose
> any data.)
>
> However, when trying to load up the routing table, things went sideways,
> and it seems to be something netlink-related.(This was using BGP to feed in
> 3 copies of the full routing table, each copy of which is about 750K
> routes.)
>
> I was hoping someone could provide good instructions (good == tested and
> works) on building from a more recent release of VPP to see if it's an
> issue that has been fixed.
>
> If the issue persists and/or looks to be netlink-specific, would anyone be
> able to look into it? I'm happy to provide logs etc.
>
> System is bare metal centos7.5, tons of cores, memory, etc.
>
> The first few messages in syslog look like:
>
> Nov 27 17:57:30 sj2tldnslab09 bird: Kernel dropped some netlink messages,
> will resync on next scan.
>
> Nov 27 17:57:30 sj2tldnslab09 vnet[127960]: rtnl_process_read:467:
> rtnetlink recv error (27) []: No buffer space available
>
> Nov 27 17:57:30 sj2tldnslab09 vnet[127960]: rtnl_process_read:467:
> rtnetlink recv error (27) []: No buffer space available
>
>
> After a bunch of similar groups of messages, VPP appears to crash.
>
>
> If this is a known problem or if there's something that needs to be
> tweaked on the host, any assistance would be greatly appreciated.
>
>
> Brian
>
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#11501): https://lists.fd.io/g/vpp-dev/message/11501
Mute This Topic: https://lists.fd.io/mt/28615952/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to