Hi Inder,

When this condition occurs and the system is hung, it's pretty difficult to
gather diagnostic data. If your system runs for 12 days before buffers are
fully exhausted, I would presume that the number of free buffers is
deteriorating gradually. So it's probably best to observe what's happening
on the system periodically while VPP is still functioning correctly and try
to gather data before it breaks rather than afterwards.

One tool that might help diagnose the problem is the bufmon plugin. You
would need to enable this plugin in your startup.conf file before starting
VPP and enable buffer traces as described at
https://github.com/FDio/vpp/blob/master/src/plugins/bufmon/bufmon_doc.rst.
If you run 'vppctl show buffer traces verbose' and then run it again a few
hours later, it may be obvious which node is leaking buffers. If you were
running on ubuntu or debian, you would need to install the
vpp-plugin-devtools package to get the bufmon plugin, but I'm not sure what
the build/packaging situation is on SUSE so I'll just say that it's
possible that you will have to go find and/or install the bufmon plugin in
order to try this.

You could also run commands like 'vppctl show errors' and 'vppctl show
runtime' and then run them again a while later and then compare the results
between the different runs to see if any pattern emerges.

If you run the diagnostic commands I mentioned above and then run them
again 12 hours later and reply back to this thead with output from both
runs of the commands, I can try to make an educated guess about further
debugging steps. If you can share more about your configuration like the
complete contents of startup.conf and the commands or sequence of APIs you
use to configure VPP (redact any info you don't want to share like public
IP addresses or encryption keys), that would also be helpful information to
have.

-Matt


On Tue, Apr 28, 2026 at 12:30 PM Inder via lists.fd.io <inderpalpatheja=
[email protected]> wrote:

> Hi Team,
>
> We are seeing VPP buffer allocation failures after ~15 days of continuous
> uptime. VPP's CLI (vppctl) is completely unresponsive, and VRRP
> advertisements are failing.
>
> ## Environment
>
> - **VPP version:** 25.10
> - **OS:** SUSE-based Linux
> - **Uptime at failure:** 15 days (VPP started Apr 16 06:54:45 UTC)
>
> ## Symptoms
>
> 1. **Buffer allocation failures** in VRRP advertisement send path:
>    ```
>    vrrp_adv_send:310: Buffer allocation failed for [0] sw_if_index 231 VR
> ID 100 IPv4
>    vrrp_adv_send:310: Buffer allocation failed for [1] sw_if_index 235 VR
> ID 101 IPv4
>    ```
>    First seen: Apr 28 14:09:31 UTC (after ~12 days of uptime)
>    Ongoing: continuous, every 1-2 seconds
>
> 2. **vppctl completely unresponsive** — all CLI commands (including `show
> version`) hang indefinitely. Unable to collect `show buffers`, `show
> memory`, or any runtime state.
>
> 3. **High system load:** 23-26 (4 VPP threads: main + 3 workers)
>
> ## VPP Configuration
>
> - **Buffers:** 128,000 per NUMA (`buffers-per-numa 128000`)
> - **CPU:** main-core 1, workers on cores 2, 17, 18 (3 workers)
> - **DPDK interfaces:** 3 physical interfaces (3 RX queues each)
> - **Memory:** main-heap-page-size 1G
> - **Hugepages:** 1024 x 2MB = 2GB total, 779 free at time of failure
> - **Enabled plugins:** dpdk, ping, vrrp, af_packet, linux_nl, linux_cp,
> crypto_native
> - **LCP:** enabled with lcp-sync
> - **Features in use:** VLAN sub-interfaces, VRF tables, bridge domains,
> VRRP, IPsec (crypto_native), LCP TAP sync
>
> ## Process State at Failure
>
> - **VmSize:** 153 GB (virtual, includes mmap/hugepage regions)
> - **VmRSS:** 433 MB
> - **Private_Hugetlb:** 2,420 MB
> - **Threads:** 22
> - **CPU consumed:** 1 month+ accumulated over 15 days
>
> ## Analysis
>
> The buffer pool (128,000 buffers) appears to be exhausted after ~12 days
> of continuous operation. VRRP advertisements are small packets — if VPP
> cannot allocate even a single buffer, the pool is fully depleted. The
> unresponsive CLI suggests the main thread is stuck or spinning, possibly in
> a buffer allocation retry loop.
>
> Hugepage memory is available (779 of 1024 pages free), so this is not a
> memory exhaustion issue — it appears to be a buffer leak where buffers are
> allocated but never returned to the free pool.
>
> ## Impact
>
> - VRRP advertisements cannot be sent → peer may trigger failover
> - VPP CLI unresponsive → no runtime diagnostics possible
> - Potential data plane impact (unable to verify due to hung CLI)
>
> So, please help to answer below questions
>
> 1. Are there known buffer leak issues in VPP 25.10 with the enabled plugin
> combination (LCP + VRRP + IPsec + DPDK)?
> 2. Is there a way to diagnose buffer leaks when vppctl is unresponsive
> (e.g., via shared memory / stats segment)?
> 3. Any recommended buffer pool tuning or workarounds for long-running
> deployments?
>
> regards
> Inder
>
> 
>
>
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#26996): https://lists.fd.io/g/vpp-dev/message/26996
Mute This Topic: https://lists.fd.io/mt/119051933/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to