topology: introduce topology-aware lcore grouping

Stephen Hemminger Tue, 14 Apr 2026 13:22:25 -0700

On Wed, 15 Apr 2026 01:08:18 +0530
Vipin Varghese <[email protected]> wrote:


> This series introduces a topology library that groups DPDK lcores based on
> CPU cache hierarchy and NUMA topology. The goal is to provide a stable and
> explicit API that allows applications to select lcores with better locality
> and cache sharing characteristics.
> 
> The series includes:
>   - EAL support for topology discovery using hwloc and domain-based lcore
>     grouping (L1/L2/L3/L4/NUMA)
>   - Topology-aware test cases validating API behavior and edge conditions
>   - Programmer’s guide describing the topology library and APIs
> 
> The API is marked experimental and does not change existing lcore behavior
> unless explicitly used by the application.
> 
> Changes in v5:
>   - Addressed review comments from v4
>   - Fixed ARM cross-compilation issues
>   - Cleaned up domain iteration and error handling
>   - Updated tests to cover domain edge cases
>   - Documentation refinements and API usage clarification
> 
> Changes in v4:
>   - Corrected domain selection semantics
>   - Updated example usage
>   - Fixed naming and typo issues
> 
> Changes in v3:
>   - Fixed macro naming (USE_NO_TOPOLOGY)
>   - Minor cleanups based on early feedback
> 
> Tested on:
>   - AMD EPYC (Milan, Genoa, Siena, Turin, Turin-Dense, Sorano)
>   - Intel Xeon (SPR-SP, GNR-SP)
>   - ARM Ampere
>   - NVIDIA Grace Superchip
> 
> Dependencies:
>   - hwloc-dev (tested with 2.10.0)
> 
> Patch breakdown:
>   1/3 eal/topology: add topology grouping for lcores
>   2/3 app: add topology-aware test cases
>   3/3 doc: add topology library documentation
> 
> Future Work:
>  - integrate into examples
>   -- hellowrld: ready
>   -- pkt-distributor: in-progress
>   -- l2fwd: ready
>   -- l3fwd: to start
>   -- eventdevpipeline: PoC ready
>  - integrate topology test
>   -- crypto: yet to start
>   -- compression: yet to start
>   -- dma: PoC ready
>  - add new features for
>   -- PQoS: yet to start
>   -- Data Injection: PoC with BRDCM Thor-2 ready
> 
> Tested OS: Linux only, need help with BSD and Windows
> 
> Tested with and without hwloc-dev library for
>  - Ampere, aarch64, Neoverse-N1, NUMA-2, 256 CPU threads
>  - Grace superchip, aarch64, Neoverse-V2, NUMA-2, 144 CPU threads
>  - Intel GNR-SP, 6767P, NUMA-2, 256 Threads
>  - AMD EPYC Siena, 8534P, NUMA-1, 128 Threads
>  - AMD EPYC Sorano, 8635P, NUMA-1, 168 Threads
> 
> Signed-off-by: Vipin Varghese <[email protected]>
> ``
> 
> Vipin Varghese (3):
>   eal/topology: add Topology grouping for lcores
>   app: add topology aware test case
>   doc: add new section topology
> 
>  app/test/meson.build                   |   1 +
>  app/test/test_ring_perf.c              | 416 +++++++++++++-
>  app/test/test_stack_perf.c             | 409 ++++++++++++++
>  app/test/test_topology.c               | 676 ++++++++++++++++++++++
>  config/meson.build                     |  18 +
>  doc/api/doxy-api-index.md              |   1 +
>  doc/guides/prog_guide/index.rst        |   3 +-
>  doc/guides/prog_guide/topology_lib.rst | 155 +++++
>  lib/eal/common/eal_private.h           |  74 +++
>  lib/eal/common/eal_topology.c          | 747 +++++++++++++++++++++++++
>  lib/eal/common/meson.build             |   1 +
>  lib/eal/freebsd/eal.c                  |  10 +-
>  lib/eal/include/meson.build            |   1 +
>  lib/eal/include/rte_topology.h         | 255 +++++++++
>  lib/eal/linux/eal.c                    |   7 +
>  lib/eal/meson.build                    |   4 +
>  16 files changed, 2773 insertions(+), 5 deletions(-)
>  create mode 100644 app/test/test_topology.c
>  create mode 100644 doc/guides/prog_guide/topology_lib.rst
>  create mode 100644 lib/eal/common/eal_topology.c
>  create mode 100644 lib/eal/include/rte_topology.h
> 

AI review reported lots of possible correctness problems:


# DPDK Patch Review - bundle-1843-topology-v5.mbox

## Overview

This patch series introduces topology awareness to DPDK, allowing applications 
to select logical cores based on CPU cache and I/O topology. The series adds a 
new `rte_topology` library with APIs to query and enumerate logical cores 
within topology domains (L1/L2/L3/L4/NUMA).

---

## Correctness Bugs (HIGHEST PRIORITY - report at >=50% confidence)

### Patch 1/3: lib/eal/common/eal_topology.c

#### **Error: Resource leak on `eal_topology_map_layer()` failure**

In `rte_eal_topology_init()`, if any call to `eal_topology_map_layer()` fails 
(returns < 0), the code calls `rte_eal_topology_release()` which frees 
previously allocated layers. However, the `hwloc_topology_t` handle 
`topo_cnfg.topology` is **NOT** destroyed before returning. This leaks the 
hwloc topology handle.

**Location:** `lib/eal/common/eal_topology.c:685-712`

```c
for (int i = 0; i < 5; i++) {
    *layers[i].count = hwloc_get_nbobjs_by_depth(topo_cnfg.topology, 
layers[i].depth);
    if (eal_topology_map_layer(topo_cnfg.topology, layers[i].depth, 
layers[i].count,
        layers[i].ptr, layers[i].total_cores, layers[i].name) < 0) {
        rte_eal_topology_release();  /* frees layer memory */
        return -1;  /* BUG: topo_cnfg.topology NOT destroyed here */
    }
}

hwloc_topology_destroy(topo_cnfg.topology);  /* only reached on success */
topo_cnfg.topology = NULL;
```

**Fix:** Destroy the topology before returning on error:

```c
if (eal_topology_map_layer(...) < 0) {
    hwloc_topology_destroy(topo_cnfg.topology);
    topo_cnfg.topology = NULL;
    rte_eal_topology_release();
    return -1;
}
```

---

#### **Error: Potential use-after-free in `eal_topology_map_layer()` on partial 
allocation failure**

In `eal_topology_map_layer()`, the code allocates `dm->cores` for each domain. 
If a later allocation in the same loop iteration fails (e.g., for domain 
`j+1`), the function returns -1 immediately **without freeing `dm->cores` 
already allocated for earlier domains**. The caller (`rte_eal_topology_init()`) 
then calls `rte_eal_topology_release()`, which expects `layer_ptr[j]` to be 
non-NULL but `layer_ptr[j]->cores` may be uninitialized or garbage if the 
allocation failed before that point. This can cause a use-after-free or 
double-free when `rte_eal_topology_release()` calls 
`rte_free(d->map[i]->cores)`.

**Location:** `lib/eal/common/eal_topology.c:537-557`

```c
for (uint16_t j = 0; j < *layer_cnt; j++) {
    hwloc_obj_t obj = hwloc_get_obj_by_depth(topology, depth, j);
    int cpu_count = hwloc_bitmap_weight(obj->cpuset);
    if (cpu_count == -1)
        continue;

    struct core_domain_mapping *dm =
        rte_zmalloc(NULL, sizeof(struct core_domain_mapping), 0);
    if (!dm)
        return -1;  /* BUG: leaks layer_ptr array allocated by caller */

    (*layer_ptr)[j] = dm;
    CPU_ZERO(&dm->core_set);
    dm->core_count = 0;

    dm->cores = rte_malloc(NULL, sizeof(uint16_t) * cpu_count, 0);
    if (!dm->cores)
        return -1;  /* BUG: leaks dm (just allocated) and previous entries */
}
```

**Fix:** On allocation failure, free all previously allocated entries before 
returning:

```c
if (!dm) {
    /* Free all previously allocated entries */
    for (uint16_t k = 0; k < j; k++) {
        if ((*layer_ptr)[k]) {
            rte_free((*layer_ptr)[k]->cores);
            rte_free((*layer_ptr)[k]);
        }
    }
    rte_free(*layer_ptr);
    return -1;
}

/* Same cleanup for dm->cores allocation failure */
if (!dm->cores) {
    rte_free(dm);
    for (uint16_t k = 0; k < j; k++) {
        if ((*layer_ptr)[k]) {
            rte_free((*layer_ptr)[k]->cores);
            rte_free((*layer_ptr)[k]);
        }
    }
    rte_free(*layer_ptr);
    return -1;
}
```

---

#### **Error: NULL pointer dereference in `eal_topology_map_layer()` when 
`hwloc_get_obj_by_depth()` returns NULL**

In the second loop of `eal_topology_map_layer()` (lines 560-620), the code 
calls `hwloc_get_obj_by_depth(topology, depth, j)` and checks `if (!obj || 
hwloc_bitmap_iszero(obj->cpuset))` to skip NULL objects. However, in the 
**inner loop** (lines 579-618), the code calls 
`hwloc_get_obj_by_depth(topology, depth, k)` and assigns it to `obj_core` but 
does **NOT** check if `obj_core` is NULL before calling 
`hwloc_bitmap_weight(obj_core->cpuset)` on line 581. If 
`hwloc_get_obj_by_depth()` returns NULL for domain `k`, this will dereference a 
NULL pointer.

**Location:** `lib/eal/common/eal_topology.c:579-582`

```c
for (uint16_t k = 0; k < *layer_cnt; k++) {
    hwloc_obj_t obj_core = hwloc_get_obj_by_depth(topology, depth, k);
    int cpu_count_core = hwloc_bitmap_weight(obj_core->cpuset);  /* NULL deref 
if obj_core == NULL */
    if (cpu_count_core == -1)
        continue;
```

**Fix:** Check `obj_core` before dereferencing:

```c
hwloc_obj_t obj_core = hwloc_get_obj_by_depth(topology, depth, k);
if (!obj_core)
    continue;
int cpu_count_core = hwloc_bitmap_weight(obj_core->cpuset);
if (cpu_count_core == -1)
    continue;
```

---

#### **Error: Incorrect second argument to 
`rte_topo_get_nth_lcore_from_domain()` in `get_same_l1_domains()` 
(test_ring_perf.c and test_stack_perf.c)**

In both `app/test/test_ring_perf.c:290` and `app/test/test_stack_perf.c:258`, 
the function `get_same_l1_domains()` calls:

```c
id2 = rte_topo_get_nth_lcore_from_domain(domain, 0, 0, RTE_TOPO_DOMAIN_L1);
```

The second argument (`lcore_pos`) is `0`, which is the same as for `id1`. This 
will assign **the same lcore** to both `id1` and `id2`, causing the subsequent 
check `if (id1 == id2) return 3;` to always trigger. This is a logic error: the 
intent is clearly to get two **different** lcores from the same domain.

**Location:** `app/test/test_ring_perf.c:287-290` and 
`app/test/test_stack_perf.c:255-258`

**Fix:** Use position `1` for the second lcore:

```c
id1 = rte_topo_get_nth_lcore_from_domain(domain, 0, 0, RTE_TOPO_DOMAIN_L1);
id2 = rte_topo_get_nth_lcore_from_domain(domain, 1, 0, RTE_TOPO_DOMAIN_L1);
```

---

#### **Error: Iteration condition in `test_main_lcore_in_domain()` uses wrong 
domain type for lookup**

In `app/test/test_topology.c:211`, the loop iterates over `domain_count` for 
`domain_types[d]`, but the call to `rte_topo_is_main_lcore_in_domain()` uses 
`RTE_TOPO_DOMAIN_NUMA` instead of `domain_types[d]`. This means the test only 
checks the NUMA domain regardless of which domain type `d` selects 
(L1/L2/L3/L4).

**Location:** `app/test/test_topology.c:206-216`

```c
for (unsigned int d = 0; d < RTE_DIM(domain_types); d++) {
    bool main_lcore_found = false;
    unsigned int domain_count = rte_topo_get_domain_count(domain_types[d]);
    for (unsigned int dmn_idx = 0; dmn_idx < domain_count; dmn_idx++) {
        main_lcore_found = 
rte_topo_is_main_lcore_in_domain(RTE_TOPO_DOMAIN_NUMA,  /* BUG: should be 
domain_types[d] */
            dmn_idx);
        if (main_lcore_found)
            break;
    }
```

**Fix:**

```c
main_lcore_found = rte_topo_is_main_lcore_in_domain(domain_types[d], dmn_idx);
```

---

#### **Error: Infinite loop risk in `rte_topo_get_nth_lcore_from_domain()` when 
`ptr->core_count` is 0**

In `lib/eal/common/eal_topology.c:296-318`, the function enters a `while (1)` 
loop that increments `new_lcore_pos`. If `ptr->core_count` is 0 (which the code 
checks earlier but does not return immediately), the loop will wrap 
`new_lcore_pos` back to 0 indefinitely, never breaking. While the function 
returns `RTE_MAX_LCORE` if `ptr->core_count == 0` before the loop, the logic 
flow is unclear and the loop body does not have a clear termination condition 
if the core count is 0.

**Location:** `lib/eal/common/eal_topology.c:283-318`

**Fix:** Add a sanity check inside the loop to prevent infinite iteration:

```c
unsigned int iterations = 0;
while (1) {
    if (iterations++ > ptr->core_count * 2)  /* safety limit */
        return RTE_MAX_LCORE;
    /* ... rest of loop ... */
}
```

However, the real issue is that the code already returns `RTE_MAX_LCORE` if 
`ptr->core_count == 0` on line 287, so this is more of a defensive-programming 
note. The function should be refactored for clarity.

---

#### **Error: Missing NULL check after `get_domain_lcore_mapping()` in 
`rte_topo_get_next_lcore()`**

In `rte_topo_get_next_lcore()`, the code calls `get_domain_lcore_mapping(flag, 
lcore_domain)` and checks if `ptr` is NULL on line 350. However, if `ptr` is 
NULL, the function returns `RTE_MAX_LCORE`. This is correct, but the subsequent 
logic on line 381 calls `rte_topo_is_main_lcore_in_domain(flag, lcore_domain)`, 
which internally may call `get_domain_lcore_mapping()` again. If that call also 
returns NULL (which it will if the domain is invalid), the function 
`rte_topo_is_main_lcore_in_domain()` will return `false`, which is safe. 
However, the logic is fragile and should explicitly handle the NULL case to 
avoid relying on transitive safety.

**Location:** `lib/eal/common/eal_topology.c:381`

**Recommendation:** The code is technically safe but could be clearer. No 
change required, but consider restructuring for maintainability.

---

### Patch 2/3: app/test Topology Tests

#### **Error: Macro `RTE_TOPO_FOREACH_WORKER_LCORE_IN_DOMAIN` declares variable 
in macro expansion (shadowing risk)**

In `lib/eal/include/rte_topology.h:243-248`, the macro 
`RTE_TOPO_FOREACH_WORKER_LCORE_IN_DOMAIN` declares a local variable 
`main_lcore` inside the macro expansion:

```c
#define RTE_TOPO_FOREACH_WORKER_LCORE_IN_DOMAIN(lcore, domain_indx, flag)  \
    lcore = rte_topo_get_nth_lcore_from_domain(domain, 0, 0, flag);  \
    uint16_t main_lcore = rte_get_main_lcore();  \
    for (lcore = (lcore != main_lcore) ? \
        lcore : rte_topo_get_next_lcore(lcore, 1, 0, flag);  \
        lcore < RTE_MAX_LCORE;  \
        lcore = rte_topo_get_next_lcore(lcore, 1, 0, flag))
```

This can cause a compiler error or shadowing if the caller already has a 
variable named `main_lcore` in scope. Additionally, the macro uses `domain` 
(line 244) but the parameter is `domain_indx`, which is a typo and will cause a 
compilation error.

**Location:** `lib/eal/include/rte_topology.h:243-248`

**Fix:** Wrap in a `do { } while (0)` and use a uniquely-named variable, or 
document that the macro must not be used if `main_lcore` is already declared. 
Also fix the typo:

```c
#define RTE_TOPO_FOREACH_WORKER_LCORE_IN_DOMAIN(lcore, domain_indx, flag)  \
    lcore = rte_topo_get_nth_lcore_from_domain(domain_indx, 0, 0, flag);  \
    uint16_t __topo_main_lcore = rte_get_main_lcore();  \
    for (lcore = (lcore != __topo_main_lcore) ? \
        lcore : rte_topo_get_next_lcore(lcore, 1, 0, flag);  \
        lcore < RTE_MAX_LCORE;  \
        lcore = rte_topo_get_next_lcore(lcore, 1, 0, flag))
```

---

### Summary of Correctness Bugs

1. **hwloc topology leak** on `eal_topology_map_layer()` failure
2. **Resource leak** in `eal_topology_map_layer()` on partial allocation failure
3. **NULL pointer dereference** in `eal_topology_map_layer()` inner loop
4. **Logic error** in `get_same_l1_domains()` (same lcore assigned to `id1` and 
`id2`)
5. **Wrong domain type** in `test_main_lcore_in_domain()` (uses 
`RTE_TOPO_DOMAIN_NUMA` instead of `domain_types[d]`)
6. **Macro typo** in `RTE_TOPO_FOREACH_WORKER_LCORE_IN_DOMAIN` (uses `domain` 
instead of `domain_indx`)
7. **Potential infinite loop** in `rte_topo_get_nth_lcore_from_domain()` if 
`ptr->core_count == 0` (mitigated by early

Re: [PATCH v5 0/3] eal/topology: introduce topology-aware lcore grouping

Reply via email to