https://bugs.dpdk.org/show_bug.cgi?id=260
Bug ID: 260 Summary: bugDPDK lock-free hash deletion Product: DPDK Version: 18.11 Hardware: All OS: All Status: CONFIRMED Severity: normal Priority: Normal Component: other Assignee: dev@dpdk.org Reporter: zhongdahulin...@163.com Target Milestone: --- lock-free版本的哈希表,在创建的时候指定了flag: RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF | RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD 以使哈希表支持multi-writer,RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD标记使得每个lcore都可以使用local cache分配key slot: struct rte_hash * rte_hash_create(const struct rte_hash_parameters *params) { ... if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD) { use_local_cache = 1; writer_takes_lock = 1; } ... /* Store all keys and leave the first entry as a dummy entry for lookup_bulk */ if (use_local_cache) /* * Increase number of slots by total number of indices * that can be stored in the lcore caches * except for the first cache */ num_key_slots = params->entries + (RTE_MAX_LCORE - 1) * (LCORE_CACHE_SIZE - 1) + 1; else num_key_slots = params->entries + 1; ... } 这么做的好处是,每个写者可以从本地cache分配key slot,可减少cache miss,提升哈希插入的性能。 这里要先说一下rte_hash的key和data的存储的结构: | dummy | pdata + key | pdata + key | pdata + key | pdata + key | pdata + key | pdata + key | pdata + key | pdata + key |... 0 | 1 2 3 4 5 6 7 8 | | <-------------- bucket ----------> | struct rte_hash里的成员key_store就是上述数组,存放key的内容和data指针,其中 index 0没被使用,有效数据从index 1 开始。该数组被划分成若干个bucket,每个bucket的大小为8。这么做是有原因的,rte_hash使用cuckoo 哈希实现,引入bucket解决哈希冲突,而非开链。以写为例,在写哈希表的时候,先哈希到primary bucket,再循环遍历该bucket找到存储位置,如若有空位则插入,否则继续找secondary bucket。 结构体定义如下: /* Structure that stores key-value pair */ struct rte_hash_key { union { uintptr_t idata; void *pdata; }; /* Variable key size */ char key[0]; }; /** Bucket structure */ struct rte_hash_bucket { uint16_t sig_current[RTE_HASH_BUCKET_ENTRIES]; uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES]; uint8_t flag[RTE_HASH_BUCKET_ENTRIES]; void *next; } __rte_cache_aligned; 然后这些key index,用一个rte_ring来存储: struct rte_hash * rte_hash_create(const struct rte_hash_parameters *params) { ... /* Populate free slots ring. Entry zero is reserved for key misses. */ for (i = 1; i < num_key_slots; i++) rte_ring_sp_enqueue(r, (void *)((uintptr_t) i)); ... } 当使用lock-free哈希表时,删除一个表项的时候key和value采用延后回收内存的策略,使得multi-readers访问哈希表不需要加锁,大大减少多核应用程序的性能损耗。我们调用dpdk API rte_hash_del_key_xxx删除表项时,只将表项从哈希表中摘除,返回一个key的存储位置position,就是上述所说的key index,然后在所有引用该表项的readers/writers都退出引用后,再根据position回收key和value的存储空间。key的回收调用一下接口: int __rte_experimental rte_hash_free_key_with_position(const struct rte_hash *h, const int32_t position) { RETURN_IF_TRUE(((h == NULL) || (position == EMPTY_SLOT)), -EINVAL); unsigned int lcore_id, n_slots; struct lcore_cache *cached_free_slots; const int32_t total_entries = h->num_buckets * RTE_HASH_BUCKET_ENTRIES; /* Out of bounds */ if (position >= total_entries) return -EINVAL; if (h->use_local_cache) { lcore_id = rte_lcore_id(); cached_free_slots = &h->local_free_slots[lcore_id]; /* Cache full, need to free it. */ if (cached_free_slots->len == LCORE_CACHE_SIZE) { /* Need to enqueue the free slots in global ring. */ n_slots = rte_ring_mp_enqueue_burst(h->free_slots, cached_free_slots->objs, LCORE_CACHE_SIZE, NULL); cached_free_slots->len -= n_slots; } /* Put index of new free slot in cache. */ cached_free_slots->objs[cached_free_slots->len] = (void *)((uintptr_t)position); cached_free_slots->len++; } else { rte_ring_sp_enqueue(h->free_slots, (void *)((uintptr_t)position)); } return 0; } 仔细看rte_hash的实现,不难发现上述函数有两个地方存在问题: 1、"Out of bounds" 的判断逻辑 这个 "Out of bounds" 的判断逻辑,在哈希表的use_local_cache标识没被置为的时候是成立的,key index的数量恰好是哈希表entries的数量。但当use_local_cache为真时,它就不正确了。回看一下创建哈希表的函数rte_hash_create,其中key slots的计算: /* Store all keys and leave the first entry as a dummy entry for lookup_bulk */ if (use_local_cache) /* * Increase number of slots by total number of indices * that can be stored in the lcore caches * except for the first cache */ num_key_slots = params->entries + (RTE_MAX_LCORE - 1) * (LCORE_CACHE_SIZE - 1) + 1; else num_key_slots = params->entries + 1; 这时候除了分配entries个key内存空间,还要给每个lcore分配LCORE_CACHE_SIZE数量的key空间,那么此时key的数量是会大于哈希表的total entry的,所以 rte_hash_free_key_with_position的"Out of bounds"判断逻辑有误。 2、position参数问题 position是rte_hash_del_key_xxx()的返回值,为 (key_idx - 1)。前面说过key的index 0未被使用,从1开始有效。那么直接将position enqueue到free_slot队列是不正确的: rte_ring_sp_enqueue(h->free_slots, (void *)((uintptr_t)position)); 这会导致ring队列里可能存在多个值为0的position,从而损坏ring队列。以ring的size是4为例: ring初始状态: |1 | 2 | 3 | 4 | dequeue完所有key_idx后,返回position为 0,1,2,3,再enqueue回ring队列 一趟dequeue enqueue过后: | 0 | 1 | 2 | 3 | dequeue完所有key_idx后,返回position为 0,1,2(其中key_idx 0是无效的,不被使用),再enqueue回ring队列 两趟dequeue enqueue过后: | 0 | 0 | 1 | 2 | 对比非lock-free的key回收接口,可以发现,remove_entry()是直接将key_indx入队的,所以不存在问题: static inline void remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i) { unsigned lcore_id, n_slots; struct lcore_cache *cached_free_slots; if (h->use_local_cache) { lcore_id = rte_lcore_id(); cached_free_slots = &h->local_free_slots[lcore_id]; /* Cache full, need to free it. */ if (cached_free_slots->len == LCORE_CACHE_SIZE) { /* Need to enqueue the free slots in global ring. */ n_slots = rte_ring_mp_enqueue_burst(h->free_slots, cached_free_slots->objs, LCORE_CACHE_SIZE, NULL); cached_free_slots->len -= n_slots; } /* Put index of new free slot in cache. */ cached_free_slots->objs[cached_free_slots->len] = (void *)((uintptr_t)bkt->key_idx[i]); cached_free_slots->len++; } else { rte_ring_sp_enqueue(h->free_slots, (void *)((uintptr_t)bkt->key_idx[i])); } } 修复方法 1、修改"Out of bounds" 的判断逻辑 2、position在enqueue回free_slots队列之前加1 -- You are receiving this mail because: You are the assignee for the bug.