Greetings again, Here is more context on the problem I'm seeing. The problem occurs if a large-ish number of IPv4 prefixes are added to the FIB (by way of the netlink and router plugin).
If the prefix count is below some threshold (e.g. 50,000 prefixes), things work fine. At some prefix count (haven't narrowed it down to a specific number, but I don't think the actual number is relevant), vnet crashes, in a failure within ip4_mtrie.c. I have been trying to run in debug mode, but am having a lot of difficulty building everything with debug. Basically, the only way I can successfully build everything is to use the script vagrant/build.sh (which does a make pkg-rpm that generates a bunch of rpm files that I then install with yum). Then, I have to rebuild things using the instructions from vppsb/router/README.md (doing 4 symlinks and various make iterations, and THEN having to run some of those with a bunch of CFLAGS values just to get it to compile). I don't see any good/easy way to build debug images from this environment, without a LOT of work/investigation on how all the various build components work. Is the problem easy enough to diagnose from a non-symbolic stack dump, or can someone provide details on how to build and run vpp with everything to use gdb, including the plugins for netlink/router, so the problem can be further isolated? I think there's basically some kind of bug related to the fib stuff in vnet, that really needs to be fixed. The box has an unreasonably large amount of memory (128GB, doing nothing but VPP), and I get the same error even if I up the initial heap size by a factor of 2^12 (changing 32<<20 to 32ULL<<32). Please help. Brian (In the following, the buffer space message is likely a consequence of the thread handling netlink messages dying, rather than a cause.) Here's the log messages: > Dec 4 17:08:14 sj2tldnslab09 vnet[19785]: dpdk_pool_create:535: > ioctl(VFIO_IOMMU_MAP_DMA) pool 'dpdk_mbuf_pool_socket0': Inappropriate > ioctl for device (errno 25) > > Dec 4 17:08:14 sj2tldnslab09 vnet[19785]: dpdk_ipsec_process:1026: not > enough DPDK crypto resources, default to OpenSSL > > Dec 4 17:08:16 sj2tldnslab09 vnet[19785]: rtnl_ns_recv:403: Received > notification while in sync. Restart synchronization. > > Dec 4 17:08:16 sj2tldnslab09 vnet[19785]: rtnl_process_read:467: > rtnetlink recv error (31) []: Bad file descriptor > > Dec 4 17:08:58 sj2tldnslab09 vnet[19785]: rtnl_process_read:467: > rtnetlink recv error (27) []: No buffer space available > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: rtnl_process_read:467: > rtnetlink recv error (27) []: No buffer space available > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: received signal SIGABRT, PC > 0x7f043c3c7277 > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #0 0x00007f043e5c18c5 > 0x7f043e5c18c5 > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #1 0x00007f043c9716d0 > 0x7f043c9716d0 > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #2 0x00007f043c3c7277 gsignal > + 0x37 > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #3 0x00007f043c3c8968 abort + > 0x148 > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #4 0x00005569eb7900d3 > 0x5569eb7900d3 > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #5 0x00007f043d0e8512 > vec_resize_allocate_memory + 0x2f2 > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #6 0x00007f043dd9809f > 0x7f043dd9809f > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #7 0x00007f043dd985cd > ip4_fib_mtrie_route_add + 0x17d > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #8 0x00007f043e129b08 > fib_entry_src_action_install + 0xb8 > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #9 0x00007f043e1274a0 > fib_entry_create + 0x70 > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #10 0x00007f043e11e890 > fib_table_entry_path_add2 + 0x190 > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #11 0x00007f03f86833fd > add_del_route + 0x34c > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #12 0x00007f03f8683594 > netns_notify_cb + 0x8c > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #13 0x00007f03f8466e71 > netns_notify + 0x1f3 > > Dec 4 17:09:07 sj2tldnslab09 vnet[19785]: #14 0x00007f03f84684ed > ns_rcv_route + 0x825 > On Tue, Nov 27, 2018 at 6:17 PM Brian Dickson <brian.peter.dick...@gmail.com> wrote: > I have been working with the netlink and router plugins, which I was able > to build from the 18.07 tree via the instructions in vppsb/router. > > (NB: trying to build from anything more recent, e.g. 18.10 or 19.01 > breaks, with no obvious easy resolution). > > When running with these plugins, connected with an open source router > (bird version 1.6.4 or 2.02) and with a very small routing table, it works > really really well. > > (I was able to run roughly line-rate 10g even with small packets, and when > using a second host with vpp and the span->pg->pcap to /tmp, didn't lose > any data.) > > However, when trying to load up the routing table, things went sideways, > and it seems to be something netlink-related.(This was using BGP to feed in > 3 copies of the full routing table, each copy of which is about 750K > routes.) > > I was hoping someone could provide good instructions (good == tested and > works) on building from a more recent release of VPP to see if it's an > issue that has been fixed. > > If the issue persists and/or looks to be netlink-specific, would anyone be > able to look into it? I'm happy to provide logs etc. > > System is bare metal centos7.5, tons of cores, memory, etc. > > The first few messages in syslog look like: > > Nov 27 17:57:30 sj2tldnslab09 bird: Kernel dropped some netlink messages, > will resync on next scan. > > Nov 27 17:57:30 sj2tldnslab09 vnet[127960]: rtnl_process_read:467: > rtnetlink recv error (27) []: No buffer space available > > Nov 27 17:57:30 sj2tldnslab09 vnet[127960]: rtnl_process_read:467: > rtnetlink recv error (27) []: No buffer space available > > > After a bunch of similar groups of messages, VPP appears to crash. > > > If this is a known problem or if there's something that needs to be > tweaked on the host, any assistance would be greatly appreciated. > > > Brian >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#11501): https://lists.fd.io/g/vpp-dev/message/11501 Mute This Topic: https://lists.fd.io/mt/28615952/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-