Hi Inder, When this condition occurs and the system is hung, it's pretty difficult to gather diagnostic data. If your system runs for 12 days before buffers are fully exhausted, I would presume that the number of free buffers is deteriorating gradually. So it's probably best to observe what's happening on the system periodically while VPP is still functioning correctly and try to gather data before it breaks rather than afterwards.
One tool that might help diagnose the problem is the bufmon plugin. You would need to enable this plugin in your startup.conf file before starting VPP and enable buffer traces as described at https://github.com/FDio/vpp/blob/master/src/plugins/bufmon/bufmon_doc.rst. If you run 'vppctl show buffer traces verbose' and then run it again a few hours later, it may be obvious which node is leaking buffers. If you were running on ubuntu or debian, you would need to install the vpp-plugin-devtools package to get the bufmon plugin, but I'm not sure what the build/packaging situation is on SUSE so I'll just say that it's possible that you will have to go find and/or install the bufmon plugin in order to try this. You could also run commands like 'vppctl show errors' and 'vppctl show runtime' and then run them again a while later and then compare the results between the different runs to see if any pattern emerges. If you run the diagnostic commands I mentioned above and then run them again 12 hours later and reply back to this thead with output from both runs of the commands, I can try to make an educated guess about further debugging steps. If you can share more about your configuration like the complete contents of startup.conf and the commands or sequence of APIs you use to configure VPP (redact any info you don't want to share like public IP addresses or encryption keys), that would also be helpful information to have. -Matt On Tue, Apr 28, 2026 at 12:30 PM Inder via lists.fd.io <inderpalpatheja= [email protected]> wrote: > Hi Team, > > We are seeing VPP buffer allocation failures after ~15 days of continuous > uptime. VPP's CLI (vppctl) is completely unresponsive, and VRRP > advertisements are failing. > > ## Environment > > - **VPP version:** 25.10 > - **OS:** SUSE-based Linux > - **Uptime at failure:** 15 days (VPP started Apr 16 06:54:45 UTC) > > ## Symptoms > > 1. **Buffer allocation failures** in VRRP advertisement send path: > ``` > vrrp_adv_send:310: Buffer allocation failed for [0] sw_if_index 231 VR > ID 100 IPv4 > vrrp_adv_send:310: Buffer allocation failed for [1] sw_if_index 235 VR > ID 101 IPv4 > ``` > First seen: Apr 28 14:09:31 UTC (after ~12 days of uptime) > Ongoing: continuous, every 1-2 seconds > > 2. **vppctl completely unresponsive** — all CLI commands (including `show > version`) hang indefinitely. Unable to collect `show buffers`, `show > memory`, or any runtime state. > > 3. **High system load:** 23-26 (4 VPP threads: main + 3 workers) > > ## VPP Configuration > > - **Buffers:** 128,000 per NUMA (`buffers-per-numa 128000`) > - **CPU:** main-core 1, workers on cores 2, 17, 18 (3 workers) > - **DPDK interfaces:** 3 physical interfaces (3 RX queues each) > - **Memory:** main-heap-page-size 1G > - **Hugepages:** 1024 x 2MB = 2GB total, 779 free at time of failure > - **Enabled plugins:** dpdk, ping, vrrp, af_packet, linux_nl, linux_cp, > crypto_native > - **LCP:** enabled with lcp-sync > - **Features in use:** VLAN sub-interfaces, VRF tables, bridge domains, > VRRP, IPsec (crypto_native), LCP TAP sync > > ## Process State at Failure > > - **VmSize:** 153 GB (virtual, includes mmap/hugepage regions) > - **VmRSS:** 433 MB > - **Private_Hugetlb:** 2,420 MB > - **Threads:** 22 > - **CPU consumed:** 1 month+ accumulated over 15 days > > ## Analysis > > The buffer pool (128,000 buffers) appears to be exhausted after ~12 days > of continuous operation. VRRP advertisements are small packets — if VPP > cannot allocate even a single buffer, the pool is fully depleted. The > unresponsive CLI suggests the main thread is stuck or spinning, possibly in > a buffer allocation retry loop. > > Hugepage memory is available (779 of 1024 pages free), so this is not a > memory exhaustion issue — it appears to be a buffer leak where buffers are > allocated but never returned to the free pool. > > ## Impact > > - VRRP advertisements cannot be sent → peer may trigger failover > - VPP CLI unresponsive → no runtime diagnostics possible > - Potential data plane impact (unable to verify due to hung CLI) > > So, please help to answer below questions > > 1. Are there known buffer leak issues in VPP 25.10 with the enabled plugin > combination (LCP + VRRP + IPsec + DPDK)? > 2. Is there a way to diagnose buffer leaks when vppctl is unresponsive > (e.g., via shared memory / stats segment)? > 3. Any recommended buffer pool tuning or workarounds for long-running > deployments? > > regards > Inder > > > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#26996): https://lists.fd.io/g/vpp-dev/message/26996 Mute This Topic: https://lists.fd.io/mt/119051933/21656 Group Owner: [email protected] Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
