Re: [dpdk-dev] [PATCH] hash table: add a bucket iterator function
On 07/31/2018 10:57 AM, Wiles, Keith wrote: On Jul 31, 2018, at 1:09 AM, Fu, Qiaobin wrote: Hi Yipeng, Thanks for the feedbacks! On Jul 30, 2018, at 4:24 PM, Wang, Yipeng1 wrote: Hi, Qiaobin, Thanks for the patch. If I understand correctly your use case is to use hash table as a "cache" that new entries should evict stale ones automatically. Could you be more specific on your use scenarios so that I can have more context? Actually, we didn’t use hash table as a “cache”, and there is no automation to evict stale ones. Instead, the functionrte_hash_bucket_iterate() allows the callers to iterate over the hash table in an incremental way, i.e., one bucket by another bucket, so that the caller doesn’t have to iterate over the whole hash table in one time. This way can avoid creating hiccups and achieve better cache performance. One possible use scenario is when DDoS attacks happen, one may want to take more CPU time to process the packets, thus iterating over the hash table incrementally is desirable. I do not see this as a cache, but only a way to iterate over the hash table. We are actually working on an extendable version of rte_hash which will link extra buckets to current table if the table is full. As long as the table is sized appropriately, there will be no conflict issues thus you don’t need to replace old entries. Will this solve your issue? Also, if the “cache” mode is what you want, we have the membership library “rte_member”. Is it applicable to your case? I agree that adding extra buckets to current table when the table is full can alleviate this scenario. Indeed, we also considered this solution before coming up our proposed solution. However, it’s still highly desirable to have a bucket iterator function. Considering the scenario where the machine’s memory is limited and cannot always allocate new memory to the hash table (e.g., DDoS attacks, switch measurement tasks, etc.), a solution is to allow the callers evict some less important (the criteria for the eviction is based on the caller’s needs) entries. Indeed, we don’t have the “cache” mode, our implementation is a way to achieve better cache performance. So, the rte_member won’t help here. w.r.t. the patch set you proposed, my concern is that the new API will expose too much internals to the user, e.g. the bucket/entries structure should be implementation dependent. Exposing internal structures would prevent improving the implementation down the road unless we keep changing API which is not recommended. The functions we add here give a way for the callers to iterate over the specific bucket, which potentially contains the entry. These functions can be made general enough to allow callers to heuristically evict some entries, and add the new ones to the hash table. Otherwise, there is no way to evict some less important entries. We have many other iterators in DPDK and this one is no different. If we can hide the internals, then it would be good. The callback function should not expose any internals only the value in the hash table, which I believe is do just that, right? The use case that led to this iterator is the following: a hash table of flows is overloaded due to a DoS attack. The easy option is to ignore new flows, but this is not optimal when there's information pointing out that a given new flow is more important than one flow in the bucket in which the new flow must be inserted. So the high-level interpretation for this iterator is to find out which are the candidates such that one must be dropped to add a new flow. That is why the patch adds rte_hash_get_primary_bucket() and rte_hash_get_secondary_bucket(). We don't need more than these candidates because the memory lookups would be overwhelming. In fact, we have found that the eight candidates that rte_hash_get_primary_bucket() gives is already good enough. Once we solved the problem above, we've noticed that with small adjustments of the code, it would make it easy to scan a hash table for stale entries one bucket at a time instead of an entry at a time (the regular iterator). The idea is that once one reads the bucket location, the information about the all entries is coming together into the processor cache, so we can take advantage of the information that is already there. While the second feature is good to have, the first one is the feature we must have in our software. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH] hash table: add a bucket iterator function
On 07/31/2018 09:40 PM, Wang, Yipeng1 wrote: How about an API that is more universal? For example, an API such as "rte_iterate_conflict_entries". After an insertion failure, this function will iterate all entries that may conflict with the newly inserted key and you could decide which entry to evict? Fine. We'll rewrite the patch to do so. Thank you for the feedback. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH] hash table: add a bucket iterator function
On 08/03/2018 11:24 AM, Stephen Hemminger wrote: Often for time based cleanup it is better to have a second linked list that is ordered by time value. Then the cleanup code can start at the oldest stop when it reaches the last item that could expire. That does mean having some form of lock and doing delete/insert on every usage. i.e spinlock(&timer_lock); TAILQ_REMOVE(&timer_list, entry, timer_list); entry->expiration = new time; TAILQ_INSERT_TAIL(&timer_list, entry, timer_list); spinunlock(&timer_unlock); We'll try it. Thanks for the suggestion, Stephen. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v2] hash table: add an iterator over conflicting entries
On 08/16/2018 10:33 PM, Honnappa Nagarahalli wrote: +/* Get the primary bucket index given the precomputed hash value. */ +static inline uint32_t rte_hash_get_primary_bucket(const struct +rte_hash *h, hash_sig_t sig) { + return sig & h->bucket_bitmask; +} + +/* Get the secondary bucket index given the precomputed hash value. */ +static inline uint32_t rte_hash_get_secondary_bucket(const struct +rte_hash *h, hash_sig_t sig) { + return rte_hash_secondary_hash(sig) & h->bucket_bitmask; } + IMO, to keep the code consistent, we do not need to have the above 2 functions. Ok. +int32_t __rte_experimental +rte_hash_iterate_conflict_entries(struct rte_conflict_iterator_state *state, + const void **key, const void **data) +{ + struct rte_hash_iterator_conflict_entries_state *__state; + + RETURN_IF_TRUE(((state == NULL) || (key == NULL) || + (data == NULL)), -EINVAL); + + __state = (struct rte_hash_iterator_conflict_entries_state *)state; + + while (__state->vnext < RTE_HASH_BUCKET_ENTRIES * 2) { + uint32_t bidx = (__state->vnext < RTE_HASH_BUCKET_ENTRIES) ? + __state->primary_bidx : __state->secondary_bidx; + uint32_t next = __state->vnext & (RTE_HASH_BUCKET_ENTRIES - 1); + uint32_t position = __state->h->buckets[bidx].key_idx[next]; + struct rte_hash_key *next_key; + /* +* The test below is unlikely because this iterator is meant +* to be used after a failed insert. +* */ + if (unlikely(position == EMPTY_SLOT)) + goto next; + + /* Get the entry in key table. */ + next_key = (struct rte_hash_key *) ( + (char *)__state->h->key_store + + position * __state->h->key_entry_size); + /* Return key and data. */ + *key = next_key->key; + *data = next_key->pdata; + +next: + /* Increment iterator. */ + __state->vnext++; + + if (likely(position != EMPTY_SLOT)) + return position - 1; + } + + return -ENOENT; +} I think, we can make this API similar to 'rte_hash_iterate'. I suggest the following API signature: int32_t rte_hash_iterate_conflict_entries (const struct rte_hash *h, const void **key, void **data, hash_sig_t sig, uint32_t *next) The goal of our interface is to support changing the underlying hash table algorithm without requiring changes in applications. As Yipeng1 Wang exemplified in the discussion of the first version of this patch, "in future, rte_hash may use three hash functions, or as I mentioned each bucket may have an additional linked list or even a second level hash table, or if the hopscotch hash replaces cuckoo hash as the new algorithm." These new algorithms may require more state than sig and next can efficiently provide in order to browse the conflicting entries. I also suggest to change the API name to ' rte_hash_iterate_bucket_entries' - 'bucket' is a well understood term in the context of hash algorithms. It's a matter of semantics here. rte_hash_iterate_conflict_entries() may cross more than one bucket. In fact, the first version of this patch tried to do exactly that, but it exposes the underlying algorithm. In addition, future algorithms may stretch what is being browsed even further. Do we also need to have 'rte_hash_iterate_conflict_entries_with_hash' API? I may have not understood the question. We are already working with the hash (i.e. sig). Did you mean something else? diff --git a/lib/librte_hash/rte_hash.h b/lib/librte_hash/rte_hash.h index f71ca9fbf..7ecb6a7eb 100644 --- a/lib/librte_hash/rte_hash.h +++ b/lib/librte_hash/rte_hash.h @@ -61,6 +61,11 @@ struct rte_hash_parameters { /** @internal A hash table structure. */ struct rte_hash; +/** @internal A hash table conflict iterator state structure. */ struct +rte_conflict_iterator_state { + uint8_t space[64]; +}; + The size depends on the current size of the state, which is subject to change with the algorithm used. We chose a size that should be robust for any future underlying algorithm. Do you have a suggestion on how to go about it? We chose to have a simple struct to enable applications to allocate a state as a local variable and avoid a memory allocation. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v2] hash table: add an iterator over conflicting entries
On 08/17/2018 03:41 PM, Honnappa Nagarahalli wrote: Do we also need to have 'rte_hash_iterate_conflict_entries_with_hash' API? I may have not understood the question. We are already working with the hash (i.e. sig). Did you mean something else? Let me elaborate. For the API 'rte_hash_lookup', there are multiple variations such as 'rte_hash_lookup_with_hash', 'rte_hash_lookup_data', 'rte_hash_lookup_with_hash_data' etc. We do not need to create similar variations for 'rte_hash_iterate_conflict_entries' API right now. But the naming of the API should be such that these variations can be created in the future. So you mean that we should actually name rte_hash_iterator_conflict_entries_init() as rte_hash_iterator_conflict_entries_init_with_hash()? I'd be fine with this. diff --git a/lib/librte_hash/rte_hash.h b/lib/librte_hash/rte_hash.h index f71ca9fbf..7ecb6a7eb 100644 --- a/lib/librte_hash/rte_hash.h +++ b/lib/librte_hash/rte_hash.h @@ -61,6 +61,11 @@ struct rte_hash_parameters { /** @internal A hash table structure. */ struct rte_hash; +/** @internal A hash table conflict iterator state structure. */ +struct rte_conflict_iterator_state { + uint8_t space[64]; +}; + Needs aligning to cache line. Ok. The size depends on the current size of the state, which is subject to change with the algorithm used. We chose a size that should be robust for any future underlying algorithm. Do you have a suggestion on how to go about it? We chose to have a simple struct to enable applications to allocate a state as a local variable and avoid a memory allocation. This looks fine after your explanation. The structure name can be changed to 'rte_iterator_state' so that it can be used in other iterator APIs too. I like this suggestion. What about the name "rte_hash_iterator_state" to make it specific to the hash table? [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v2] hash table: add an iterator over conflicting entries
On 08/17/2018 03:41 PM, Honnappa Nagarahalli wrote: Can you elaborate more on using ' struct rte_conflict_iterator_state' as the argument for the API? If the API signature is changed to: rte_hash_iterate_conflict_entries (const struct rte_hash *h, void **key, void **data, const hash_sig_t sig, struct rte_conflict_iterator_state *state) - it will be inline with the existing APIs. Contents of 'state' must be initialized to 0 for the first call. This will also avoid creating 'rte_hash_iterator_conflict_entries_init' API. Testing `state' every time rte_hash_iterate_conflict_entries() is called to find out if it's the first call of the iterator will possibly add some small, but unnecessary, overhead on rte_hash_iterate_conflict_entries() and constraints on struct rte_conflict_iterator_state. Moreover, rte_hash_iterator_conflict_entries_init() enables one to easily add variations of the init function to initialize the state (e.g. using a key instead of a sig) and still use the exactly same iterator. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v2] hash table: add an iterator over conflicting entries
On 08/21/2018 01:10 AM, Honnappa Nagarahalli wrote: On 08/17/2018 03:41 PM, Honnappa Nagarahalli wrote: Can you elaborate more on using ' struct rte_conflict_iterator_state' as the argument for the API? If the API signature is changed to: rte_hash_iterate_conflict_entries (const struct rte_hash *h, void **key, void **data, const hash_sig_t sig, struct rte_conflict_iterator_state *state) - it will be inline with the existing APIs. Contents of 'state' must be initialized to 0 for the first call. This will also avoid creating 'rte_hash_iterator_conflict_entries_init' API. Testing `state' every time rte_hash_iterate_conflict_entries() is called to find out if it's the first call of the iterator will possibly add some small, but unnecessary, overhead on rte_hash_iterate_conflict_entries() and constraints on struct rte_conflict_iterator_state. Moreover, rte_hash_iterator_conflict_entries_init() enables one to easily add variations of the init function to initialize the state (e.g. using a key instead of a sig) and still use the exactly same iterator. IMO, I think, this over-head will be trivial. Looking at the function 'rte_hash_iterate_conflict_entries' the check for '(__state->vnext < RTE_HASH_BUCKET_ENTRIES * 2)' already exists. If the primary/secondary bucket indices are calculated as well in 'rte_hash_iterate_conflict_entries' API ('rte_hash_iterate' API does such calculations), storing them in the state can be avoided. I am wondering if it makes sense to benchmark with these changes and then take a decision? We have come up with the init function and struct rte_conflict_iterator_state in v2 to make the new iterator as future proof to a change of the underlying algorithm as possible. But going through your feedback, it seems to me that your top concern is to not deviate much of the current interface of rte_hash_iterate(). We are fine with pushing v3 using the interface you've suggested to avoid the init function and struct rte_conflict_iterator_state: int32_t rte_hash_iterate_conflict_entries__with_hash(const struct rte_hash *h, const void **key, void **data, hash_sig_t sig, uint32_t *next); [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v2] hash table: add an iterator over conflicting entries
On 08/23/2018 08:33 PM, Wang, Yipeng1 wrote: I think with Honnappa suggested "uint32_t* next", we may need a little bit tricks to make it work with the extra linked list. The performance may not be optimal though comparing to your original approach. Is this important to your use case? It is. We are developing a DDoS protection system, and have chosen DPDK because it was the fastest framework in the evaluations we considered. We need to find the conflicting entries when a critical flow table of our system is overloaded due to an ongoing attack, so the more efficient we can evaluate the merits of an incoming flow against the conflicting flows already in the table, the higher the chances we find the flows that should be in the flow table. We've compromised with Honnappa under the understanding that once the underlying algorithm changes, there would be a review of the interface since even rte_hash_iterate() may be affected. I still think that the v2 we proposed is the best approach here because it isolates the interface from the underlying algorithm. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v2] hash table: add an iterator over conflicting entries
On 08/26/2018 11:12 PM, Honnappa Nagarahalli wrote: On 08/23/2018 08:33 PM, Wang, Yipeng1 wrote: I think with Honnappa suggested "uint32_t* next", we may need a little bit tricks to make it work with the extra linked list. The performance may not be optimal though comparing to your original approach. Is this important to your use case? It is. We are developing a DDoS protection system, and have chosen DPDK because it was the fastest framework in the evaluations we considered. We need to find the conflicting entries when a critical flow table of our system is overloaded due to an ongoing attack, so the more efficient we can evaluate the merits of an incoming flow against the conflicting flows already in the table, the higher the chances we find the flows that should be in the flow table. We've compromised with Honnappa under the understanding that once the underlying algorithm changes, there would be a review of the interface since even rte_hash_iterate() may be affected. I still think that the v2 we proposed is the best approach here because it isolates the interface from the underlying algorithm. My only concern was to do with keeping the interfaces across APIs consistent. I am fine with changing 'uint32_t *next' as long as we change 'rte_hash_iterate' API as well. We'll patch rte_hash_iterate() as well in v3. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries
Hi Yipeng, On 09/04/2018 02:55 PM, Wang, Yipeng1 wrote: Do we need both of the state and istate struct? struct rte_hash_iterator_state seems not doing much. How about we only have one "state" struct and just not expose the internals to the public API, similar to the rte_hash struct or rte_member_setsum struct. And in _init function use rte_malloc to allocate the state and add a _free function to free it. The purpose of have struct state is to enable applications to allocate iterator states on their execution stack or embedding iterator states in larger structs to avoid an extra malloc()/free(). Do you foresee that the upcoming new underlying algorithm of hash tables will need to dynamically allocate iterator states? [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries
te *state, const void **key, void **data) How about "rte_hash_dup_next()"? Also, please break the parameter list instead of having an empty first line. We are preserving the name convention used in DPDK (e.g. rte_hash_iterate()). We'll break the parameter list to avoid the empty first line. diff --git a/lib/librte_hash/rte_hash_version.map b/lib/librte_hash/rte_hash_version.map index e216ac8e2..301d4638c 100644 --- a/lib/librte_hash/rte_hash_version.map +++ b/lib/librte_hash/rte_hash_version.map @@ -24,6 +24,7 @@ DPDK_2.1 { rte_hash_add_key_data; rte_hash_add_key_with_hash_data; + rte_hash_iterator_init; rte_hash_iterate; rte_hash_lookup_bulk_data; rte_hash_lookup_data; @@ -53,3 +54,10 @@ DPDK_18.08 { rte_hash_count; } DPDK_16.07; + +EXPERIMENTAL { + global: + + rte_hash_iterator_conflict_entries_init_with_hash; + rte_hash_iterate_conflict_entries; +}; A blank line may be missing here. Do you mean a blank line before "};"? Thank you for the careful review. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries
Hi Yipeng, On 09/04/2018 03:51 PM, Wang, Yipeng1 wrote: Hmm, I guess my comment is for code readability. If we don’t need the extra state that would be great. Notice that applications only see the public, opaque state. And the private versions are scoped to the C file where they are needed. I think "rte_hash" is defined as an internal data structure but expose the type to the public header. Would this work? Exposing the private fields would bind the interface with the current implementation of the hash table. In the way we are proposing, one should be able to replace the underlying algorithm and not touching the header files that applications use. But, yes, your solution would enable applications to allocate iterator states as local variables as well. I propose to malloc inside function mostly because I think it is cleaner to the user. But your argument is valid. Depending on use case I think it is OK. I don't know how other applications will use this iterator, but we use it when our application is already overloaded. So avoiding an expensive operation like malloc() is a win. Another comment is you put the total_entry in the state, is it for performance of the rte_hash_iterate? We are saving one integer multiplication per call of rte_hash_iterate(). It's not a big deal, but since there's room in the state variable, we thought this would be a good idea because it grows with the size of the table. We didn't actually measure the effect of this decision. If you use it to iterate conflict entries, especially If you reuse same "state" struct and init it again and again for different keys, would this slow down the performance for your specific use case? Notice that the field total_entry only exists for rte_hash_iterate(). But even if total_entry were in the state of rte_hash_iterate_conflict_entries(), it would still save on the multiplication as long as rte_hash_iterate_conflict_entries() is called at least twice. Calling rte_hash_iterate_conflict_entries() once evens out, and calling rte_hash_iterate_conflict_entries() more times adds further savings. As a side note. in our application, whenever an iterator of conflicting entries is initialized, we call rte_hash_iterate_conflict_entries() at least once. Also iterate_conflic_entry may need reader lock protection. We are going to add the reader lock protection. Thanks. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries
Hi Yipeng, On 09/04/2018 04:57 PM, Wang, Yipeng1 wrote: -Original Message- From: Michel Machado [mailto:mic...@digirati.com.br] Exposing the private fields would bind the interface with the current implementation of the hash table. In the way we are proposing, one should be able to replace the underlying algorithm and not touching the header files that applications use. But, yes, your solution would enable applications to allocate iterator states as local variables as well. [Wang, Yipeng] I didn't mean to expose the private fields. But only the Type. For example, rte_hash does not expose its private fields to users. One can change the fields without changing API. The fact that struct rte_hash does not expose its private fields but only its type to applications means that a compiler cannot find out the byte length of struct rte_hash using only the header rte_hash.h. Thus, an application cannot allocate memory on its own (e.g. as a local variable) for a struct rte_hash. An application can, however, have a pointer to a struct rte_hash since the byte length of a pointer only depends on the architecture of the machine. This is the motivation behind having struct rte_hash_iterator_state in rte_hash.h only holding an array of bytes. There are good reasons to implement struct rte_hash as it is. For examples, struct rte_hash can change its byte length between versions of DPDK even if applications are dynamically linked to DPDK and not recompiled. Moreover a hash table is unlikely to be so short-lived as an iterator. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries
>key_entry_size); /* Return key and data */ *key = next_key->key; *data = next_key->pdata; - __hash_rw_reader_unlock(h); + __hash_rw_reader_unlock(__state->h); /* Increment iterator */ - (*next)++; + __state->next++; This comment is not related to this change, it is better to place this inside the lock. Even though __state->next does not depend on the lock? return position - 1; } + +/* istate stands for internal state. */ struct +rte_hash_iterator_conflict_entries_istate { + const struct rte_hash *h; This can be moved outside of this structure. Discussed earlier. + uint32_t vnext; + uint32_t primary_bidx; + uint32_t secondary_bidx; +}; + +int32_t __rte_experimental +rte_hash_iterator_conflict_entries_init_with_hash(const struct rte_hash *h, + hash_sig_t sig, struct rte_hash_iterator_state *state) { + struct rte_hash_iterator_conflict_entries_istate *__state; + + RETURN_IF_TRUE(((h == NULL) || (state == NULL)), -EINVAL); + + __state = (struct rte_hash_iterator_conflict_entries_istate *)state; + __state->h = h; + __state->vnext = 0; + + /* Get the primary bucket index given the precomputed hash value. */ + __state->primary_bidx = sig & h->bucket_bitmask; + /* Get the secondary bucket index given the precomputed hash value. */ + __state->secondary_bidx = + rte_hash_secondary_hash(sig) & h->bucket_bitmask; + + return 0; +} IMO, as mentioned above, it is possible to avoid creating this API. Discussed earlier. + +int32_t __rte_experimental +rte_hash_iterate_conflict_entries( + struct rte_hash_iterator_state *state, const void **key, void **data) Signature of this API can be changed as follows: rte_hash_iterate_conflict_entries( struct rte_hash *h, const void **key, void **data, struct rte_hash_iterator_state *state) Discussed earlier. +{ + struct rte_hash_iterator_conflict_entries_istate *__state; + + RETURN_IF_TRUE(((state == NULL) || (key == NULL) || + (data == NULL)), -EINVAL); + + __state = (struct rte_hash_iterator_conflict_entries_istate *)state; + + while (__state->vnext < RTE_HASH_BUCKET_ENTRIES * 2) { + uint32_t bidx = __state->vnext < RTE_HASH_BUCKET_ENTRIES ? + __state->primary_bidx : __state->secondary_bidx; + uint32_t next = __state->vnext & (RTE_HASH_BUCKET_ENTRIES - 1); take the reader lock before reading bucket entry Thanks for pointing this out. We are going to do so. The lock came in as we go through the versions of this patch. + uint32_t position = __state->h->buckets[bidx].key_idx[next]; + struct rte_hash_key *next_key; + + /* Increment iterator. */ + __state->vnext++; + + /* +* The test below is unlikely because this iterator is meant +* to be used after a failed insert. +*/ + if (unlikely(position == EMPTY_SLOT)) + continue; + + /* Get the entry in key table. */ + next_key = (struct rte_hash_key *) ( + (char *)__state->h->key_store + + position * __state->h->key_entry_size); + /* Return key and data. */ + *key = next_key->key; + *data = next_key->pdata; give the reader lock We'll do so. + + return position - 1; + } + + return -ENOENT; +} diff --git a/lib/librte_hash/rte_hash.h b/lib/librte_hash/rte_hash.h index 9e7d9315f..fdb01023e 100644 --- a/lib/librte_hash/rte_hash.h +++ b/lib/librte_hash/rte_hash.h @@ -14,6 +14,8 @@ #include #include +#include + #ifdef __cplusplus extern "C" { #endif @@ -64,6 +66,16 @@ struct rte_hash_parameters { /** @internal A hash table structure. */ struct rte_hash; +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * @internal A hash table iterator state structure. + */ +struct rte_hash_iterator_state { + uint8_t space[64]; I would call this 'state'. 64 can be replaced by 'RTE_CACHE_LINE_SIZE'. Okay. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries
RIES; - idx = *next % RTE_HASH_BUCKET_ENTRIES; + bucket_idx = __state->next / RTE_HASH_BUCKET_ENTRIES; + idx = __state->next % RTE_HASH_BUCKET_ENTRIES; } - __hash_rw_reader_lock(h); + __hash_rw_reader_lock(__state->h); /* Get position of entry in key table */ - position = h->buckets[bucket_idx].key_idx[idx]; - next_key = (struct rte_hash_key *) ((char *)h->key_store + - position * h->key_entry_size); + position = __state->h->buckets[bucket_idx].key_idx[idx]; + next_key = (struct rte_hash_key *) ((char *)__state->h->key_store + + position * __state->h->key_entry_size); /* Return key and data */ *key = next_key->key; *data = next_key->pdata; - __hash_rw_reader_unlock(h); + __hash_rw_reader_unlock(__state->h); /* Increment iterator */ - (*next)++; + __state->next++; This comment is not related to this change, it is better to place this inside the lock. Even though __state->next does not depend on the lock? It depends on if this API needs to be multi-thread safe. Interestingly, the documentation does not say it is multi-thread safe. If it has to be multi-thread safe, then the state also needs to be protected. For ex: what happens if the user uses a global variable for the state? If an application needs to share an iterator state between threads, it has to have a synchronization mechanism for that as it would for any other shared variable. The lock above is allowing applications to share a hash table between threads, it has no semantic over anything else. diff --git a/lib/librte_hash/rte_hash.h b/lib/librte_hash/rte_hash.h index 9e7d9315f..fdb01023e 100644 --- a/lib/librte_hash/rte_hash.h +++ b/lib/librte_hash/rte_hash.h @@ -14,6 +14,8 @@ #include #include +#include + #ifdef __cplusplus extern "C" { #endif @@ -64,6 +66,16 @@ struct rte_hash_parameters { /** @internal A hash table structure. */ struct rte_hash; +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * @internal A hash table iterator state structure. + */ +struct rte_hash_iterator_state { + uint8_t space[64]; I would call this 'state'. 64 can be replaced by 'RTE_CACHE_LINE_SIZE'. Okay. I think we should not replace 64 with RTE_CACHE_LINE_SIZE because the ABI would change based on the architecture for which it's compiled. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries
On 09/05/2018 04:27 PM, Wang, Yipeng1 wrote: Hmm I see, it falls back to my original thought to have malloc inside the init function.. Thanks for the explanation. :) So I guess with your implementation, in future if we change the internal state to be larger, the ABI will be broken. If that happens, yes, the ABI would need to change again. But this concern is overblown for two reasons. First, this event is unlikely to happen because struct rte_hash_iterator_state is already allocating 64 bytes while struct rte_hash_iterator_istate and struct rte_hash_iterator_conflict_entries_istate consume 16 and 20 bytes, respectively. Thus, the complexity of the underlying hash algorithm would need to grow substantially to force the necessary state of these iterators to grow more than 4x and 3x, respectively. This is unlikely to happen, and, if it does, it would likely break the ABI somewhere else and have a high impact on applications anyway. Second, even if the unlikely event happens, all one would need to do is to increase the size of struct rte_hash_iterator_state, mark the new API as a new version, and applications would be ready for the new ABI just recompiling. BTW, this patch set also changes API so proper notice is needed. People more familiar with API/ABI change policies may be able to help here. We'd be happy to get feedback on this aspect. Just to confirm, is there anyway like I said for your application to have some long-live states and reuse them throughout the application so that you don’t have to have short-lived ones in stack? Two things would need to happen for this to be possible. The init functions would need to accept previously allocated iterator states, that is, the init function would act as a reset of the state when acting on a previous allocated state. And, applications would now need to carry these pre-allocated state to avoid a malloc. In order words, we'll increase the complexity of the API. To emphasize that the cost of a malloc is not negligible, rte_malloc() needs to get a spinlock (see heap_alloc_on_socket()), do its thing to allocate memory, and, if the first attempt fails, try to allocate the memory on other sockets (see end of malloc_heap_alloc()). For an iterator that goes through the whole hash table, this cost may be okay, but for an iterator that goes through a couple entries, this cost is a lot to add. This memory allocation concern is not new. Function rte_pktmbuf_read(), for example, let applications pass buffers, which are often allocated in the execution stack, to avoid the malloc cost. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries
On 09/12/2018 04:37 PM, Honnappa Nagarahalli wrote: +int32_t +rte_hash_iterator_init(const struct rte_hash *h, + struct rte_hash_iterator_state *state) { + struct rte_hash_iterator_istate *__state; '__state' can be replaced by 's'. + + RETURN_IF_TRUE(((h == NULL) || (state == NULL)), -EINVAL); + + __state = (struct rte_hash_iterator_istate *)state; + __state->h = h; + __state->next = 0; + __state->total_entries = h->num_buckets * RTE_HASH_BUCKET_ENTRIES; + + return 0; +} IMO, creating this API can be avoided if the initialization is handled in 'rte_hash_iterate' function. The cost of doing this is very trivial (one extra 'if' statement) in 'rte_hash_iterate' function. It will help keep the number of APIs to minimal. Applications would have to initialize struct rte_hash_iterator_state *state before calling rte_hash_iterate() anyway. Why not initializing the fields of a state only once? My concern is about creating another API for every iterator API. You have a valid point on saving cycles as this API applies for data plane. Have you done any performance benchmarking with and without this API? May be we can guide our decision based on that. It's not just about creating one init function for each iterator because an iterator may have a couple of init functions. For example, someone may eventually find useful to add another init function for the conflicting-entry iterator that we are advocating in this patch. A possibility would be for this new init function to use the key of the new entry instead of its signature to initialize the state. Similar to what is already done in rte_hash_lookup*() functions. In spite of possibly having multiple init functions, there will be a single iterator function. About the performance benchmarking, the current API only requites applications to initialize a single 32-bit integer. But with the adoption of a struct for the state, the initialization will grow to 64 bytes. As my tests showed, I do not see any impact of this. Ok, we are going to eliminate the init functions in v4. diff --git a/lib/librte_hash/rte_hash.h b/lib/librte_hash/rte_hash.h index 9e7d9315f..fdb01023e 100644 --- a/lib/librte_hash/rte_hash.h +++ b/lib/librte_hash/rte_hash.h @@ -14,6 +14,8 @@ #include #include +#include + #ifdef __cplusplus extern "C" { #endif @@ -64,6 +66,16 @@ struct rte_hash_parameters { /** @internal A hash table structure. */ struct rte_hash; +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * @internal A hash table iterator state structure. + */ +struct rte_hash_iterator_state { + uint8_t space[64]; I would call this 'state'. 64 can be replaced by 'RTE_CACHE_LINE_SIZE'. Okay. I think we should not replace 64 with RTE_CACHE_LINE_SIZE because the ABI would change based on the architecture for which it's compiled. Ok. May be have a #define for 64? Ok. [ ]'s Michel Machado
Re: [dpdk-dev] [PATCH 3/3] examples/bpf: remove duplicate mbuf definition
Since rte_mbuf_core.h requires generic/rte_atomic.h, shouldn't rte_mbuf_core.h include generic/rte_atomic.h instead of forcing users of rte_mbuf_core.h be aware of this dependency? [ ]'s Michel Machado On 8/16/19 8:53 AM, Konstantin Ananyev wrote: Get rid of duplicate definitions for rte_mbuf and related macros. Include rte_mbuf_core.h instead. Signed-off-by: Konstantin Ananyev --- examples/bpf/mbuf.h | 605 examples/bpf/t2.c | 5 +- examples/bpf/t3.c | 3 +- 3 files changed, 5 insertions(+), 608 deletions(-) delete mode 100644 examples/bpf/mbuf.h diff --git a/examples/bpf/mbuf.h b/examples/bpf/mbuf.h deleted file mode 100644 index e41c3d26e..0 --- a/examples/bpf/mbuf.h +++ /dev/null @@ -1,605 +0,0 @@ -/* SPDX-License-Identifier: BSD-3-Clause - * Copyright(c) 2010-2014 Intel Corporation. - * Copyright 2014 6WIND S.A. - */ - -/* - * Snipper from dpdk.org rte_mbuf.h. - * used to provide BPF programs information about rte_mbuf layout. - */ - -#ifndef _MBUF_H_ -#define _MBUF_H_ - -#include -#include - -#ifdef __cplusplus -extern "C" { -#endif - -/* - * Packet Offload Features Flags. It also carry packet type information. - * Critical resources. Both rx/tx shared these bits. Be cautious on any change - * - * - RX flags start at bit position zero, and get added to the left of previous - * flags. - * - The most-significant 3 bits are reserved for generic mbuf flags - * - TX flags therefore start at bit position 60 (i.e. 63-3), and new flags get - * added to the right of the previously defined flags i.e. they should count - * downwards, not upwards. - * - * Keep these flags synchronized with rte_get_rx_ol_flag_name() and - * rte_get_tx_ol_flag_name(). - */ - -/** - * RX packet is a 802.1q VLAN packet. This flag was set by PMDs when - * the packet is recognized as a VLAN, but the behavior between PMDs - * was not the same. This flag is kept for some time to avoid breaking - * applications and should be replaced by PKT_RX_VLAN_STRIPPED. - */ -#define PKT_RX_VLAN_PKT (1ULL << 0) - -#define PKT_RX_RSS_HASH (1ULL << 1) -/**< RX packet with RSS hash result. */ -#define PKT_RX_FDIR (1ULL << 2) -/**< RX packet with FDIR match indicate. */ - -/** - * Deprecated. - * Checking this flag alone is deprecated: check the 2 bits of - * PKT_RX_L4_CKSUM_MASK. - * This flag was set when the L4 checksum of a packet was detected as - * wrong by the hardware. - */ -#define PKT_RX_L4_CKSUM_BAD (1ULL << 3) - -/** - * Deprecated. - * Checking this flag alone is deprecated: check the 2 bits of - * PKT_RX_IP_CKSUM_MASK. - * This flag was set when the IP checksum of a packet was detected as - * wrong by the hardware. - */ -#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) - -#define PKT_RX_EIP_CKSUM_BAD (1ULL << 5) -/**< External IP header checksum error. */ - -/** - * A vlan has been stripped by the hardware and its tci is saved in - * mbuf->vlan_tci. This can only happen if vlan stripping is enabled - * in the RX configuration of the PMD. - */ -#define PKT_RX_VLAN_STRIPPED (1ULL << 6) - -/** - * Mask of bits used to determine the status of RX IP checksum. - * - PKT_RX_IP_CKSUM_UNKNOWN: no information about the RX IP checksum - * - PKT_RX_IP_CKSUM_BAD: the IP checksum in the packet is wrong - * - PKT_RX_IP_CKSUM_GOOD: the IP checksum in the packet is valid - * - PKT_RX_IP_CKSUM_NONE: the IP checksum is not correct in the packet - * data, but the integrity of the IP header is verified. - */ -#define PKT_RX_IP_CKSUM_MASK ((1ULL << 4) | (1ULL << 7)) - -#define PKT_RX_IP_CKSUM_UNKNOWN 0 -#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) -#define PKT_RX_IP_CKSUM_GOOD(1ULL << 7) -#define PKT_RX_IP_CKSUM_NONE((1ULL << 4) | (1ULL << 7)) - -/** - * Mask of bits used to determine the status of RX L4 checksum. - * - PKT_RX_L4_CKSUM_UNKNOWN: no information about the RX L4 checksum - * - PKT_RX_L4_CKSUM_BAD: the L4 checksum in the packet is wrong - * - PKT_RX_L4_CKSUM_GOOD: the L4 checksum in the packet is valid - * - PKT_RX_L4_CKSUM_NONE: the L4 checksum is not correct in the packet - * data, but the integrity of the L4 data is verified. - */ -#define PKT_RX_L4_CKSUM_MASK ((1ULL << 3) | (1ULL << 8)) - -#define PKT_RX_L4_CKSUM_UNKNOWN 0 -#define PKT_RX_L4_CKSUM_BAD (1ULL << 3) -#define PKT_RX_L4_CKSUM_GOOD(1ULL << 8) -#define PKT_RX_L4_CKSUM_NONE((1ULL << 3) | (1ULL << 8)) - -#define PKT_RX_IEEE1588_PTP (1ULL << 9) -/**< RX IEEE1588 L2 Ethernet PT Packet. */ -#define PKT_RX_IEEE1588_TMST (1ULL << 10) -/**< RX IEEE1588 L2/L4 timestamped packet.*/ -#define PKT_RX_FDIR_ID (1ULL << 13) -/**< FD id reported if FDIR match. */ -#define PKT_RX_FDIR_FLX (1ULL << 14) -/**< Flexible bytes reported if FDIR match. */ - -/** - * The 2 vlans have been stripped by
Re: [dpdk-dev] [PATCH 1/3] eal: move CACHE and IOVA related definitions
Acked-by: Michel Machado [ ]'s Michel Machado On 8/16/19 8:53 AM, Konstantin Ananyev wrote: Right now RTE_CACHE_ and IOVA definitions are located inside rte_memory.h That might cause an unwanted inclusions of arch/os specific header files. See [1] for particular problem example. Probably the simplest way to deal with such problems - move these definitions into rte_commmon.h Note that this move doesn't introduce any change in functionality. [1] https://bugs.dpdk.org/show_bug.cgi?id=321 Suggested-by: Vipin Varghese Signed-off-by: Konstantin Ananyev --- lib/librte_eal/common/include/rte_common.h | 44 ++ lib/librte_eal/common/include/rte_memory.h | 38 --- 2 files changed, 44 insertions(+), 38 deletions(-) diff --git a/lib/librte_eal/common/include/rte_common.h b/lib/librte_eal/common/include/rte_common.h index 05a3a6401..c275093d7 100644 --- a/lib/librte_eal/common/include/rte_common.h +++ b/lib/librte_eal/common/include/rte_common.h @@ -291,6 +291,50 @@ rte_is_aligned(void *ptr, unsigned align) */ #define RTE_BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)])) +/*** RTE_CACHE related macros / + +#define RTE_CACHE_LINE_MASK (RTE_CACHE_LINE_SIZE-1) /**< Cache line mask. */ + +#define RTE_CACHE_LINE_ROUNDUP(size) \ + (RTE_CACHE_LINE_SIZE * ((size + RTE_CACHE_LINE_SIZE - 1) / \ + RTE_CACHE_LINE_SIZE)) +/**< Return the first cache-aligned value greater or equal to size. */ + +/**< Cache line size in terms of log2 */ +#if RTE_CACHE_LINE_SIZE == 64 +#define RTE_CACHE_LINE_SIZE_LOG2 6 +#elif RTE_CACHE_LINE_SIZE == 128 +#define RTE_CACHE_LINE_SIZE_LOG2 7 +#else +#error "Unsupported cache line size" +#endif + +#define RTE_CACHE_LINE_MIN_SIZE 64 /**< Minimum Cache line size. */ + +/** + * Force alignment to cache line. + */ +#define __rte_cache_aligned __rte_aligned(RTE_CACHE_LINE_SIZE) + +/** + * Force minimum cache line alignment. + */ +#define __rte_cache_min_aligned __rte_aligned(RTE_CACHE_LINE_MIN_SIZE) + +/*** PA/IOVA type definitions / + +typedef uint64_t phys_addr_t; /**< Physical address. */ +#define RTE_BAD_PHYS_ADDR ((phys_addr_t)-1) +/** + * IO virtual address type. + * When the physical addressing mode (IOVA as PA) is in use, + * the translation from an IO virtual address (IOVA) to a physical address + * is a direct mapping, i.e. the same value. + * Otherwise, in virtual mode (IOVA as VA), an IOMMU may do the translation. + */ +typedef uint64_t rte_iova_t; +#define RTE_BAD_IOVA ((rte_iova_t)-1) + /** * Combines 32b inputs most significant set bits into the least * significant bits to construct a value with the same MSBs as x diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h index 4717dcb43..38e00e382 100644 --- a/lib/librte_eal/common/include/rte_memory.h +++ b/lib/librte_eal/common/include/rte_memory.h @@ -39,44 +39,6 @@ enum rte_page_sizes { }; #define SOCKET_ID_ANY -1/**< Any NUMA socket. */ -#define RTE_CACHE_LINE_MASK (RTE_CACHE_LINE_SIZE-1) /**< Cache line mask. */ - -#define RTE_CACHE_LINE_ROUNDUP(size) \ - (RTE_CACHE_LINE_SIZE * ((size + RTE_CACHE_LINE_SIZE - 1) / RTE_CACHE_LINE_SIZE)) -/**< Return the first cache-aligned value greater or equal to size. */ - -/**< Cache line size in terms of log2 */ -#if RTE_CACHE_LINE_SIZE == 64 -#define RTE_CACHE_LINE_SIZE_LOG2 6 -#elif RTE_CACHE_LINE_SIZE == 128 -#define RTE_CACHE_LINE_SIZE_LOG2 7 -#else -#error "Unsupported cache line size" -#endif - -#define RTE_CACHE_LINE_MIN_SIZE 64 /**< Minimum Cache line size. */ - -/** - * Force alignment to cache line. - */ -#define __rte_cache_aligned __rte_aligned(RTE_CACHE_LINE_SIZE) - -/** - * Force minimum cache line alignment. - */ -#define __rte_cache_min_aligned __rte_aligned(RTE_CACHE_LINE_MIN_SIZE) - -typedef uint64_t phys_addr_t; /**< Physical address. */ -#define RTE_BAD_PHYS_ADDR ((phys_addr_t)-1) -/** - * IO virtual address type. - * When the physical addressing mode (IOVA as PA) is in use, - * the translation from an IO virtual address (IOVA) to a physical address - * is a direct mapping, i.e. the same value. - * Otherwise, in virtual mode (IOVA as VA), an IOMMU may do the translation. - */ -typedef uint64_t rte_iova_t; -#define RTE_BAD_IOVA ((rte_iova_t)-1) /** * Physical memory segment descriptor.
Re: [dpdk-dev] [PATCH 2/3] mbuf: move mbuf definition into a separate file
Acked-by: Michel Machado [ ]'s Michel Machado On 8/16/19 8:53 AM, Konstantin Ananyev wrote: Right now inclusion of rte_mbuf.h header can cause inclusion of some arch/os specific headers. That prevents it to be included directly by some non-DPDK (but related) entities: KNI, BPF programs, etc. To overcome that problem usually a separate definitions of rte_mbuf structure is created within these entities. That aproach has a lot of drawbacks: code duplication, error prone, etc. This patch moves rte_mbuf structure definition (and some related macros) into a separate file that can be included by both rte_mbuf.h and other non-DPDK entities. Note that it doesn't introduce any change for current DPDK code. Signed-off-by: Konstantin Ananyev --- lib/librte_mbuf/Makefile| 5 +- lib/librte_mbuf/meson.build | 3 +- lib/librte_mbuf/rte_mbuf.h | 738 + lib/librte_mbuf/rte_mbuf_core.h | 793 4 files changed, 800 insertions(+), 739 deletions(-) create mode 100644 lib/librte_mbuf/rte_mbuf_core.h diff --git a/lib/librte_mbuf/Makefile b/lib/librte_mbuf/Makefile index c8f6d2689..f3b76ad23 100644 --- a/lib/librte_mbuf/Makefile +++ b/lib/librte_mbuf/Makefile @@ -19,6 +19,9 @@ LIBABIVER := 5 SRCS-$(CONFIG_RTE_LIBRTE_MBUF) := rte_mbuf.c rte_mbuf_ptype.c rte_mbuf_pool_ops.c # install includes -SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include := rte_mbuf.h rte_mbuf_ptype.h rte_mbuf_pool_ops.h +SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include := rte_mbuf.h +SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include += rte_mbuf_core.h +SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include += rte_mbuf_ptype.h +SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include += rte_mbuf_pool_ops.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_mbuf/meson.build b/lib/librte_mbuf/meson.build index 6cc11ebb4..36bb6eb9d 100644 --- a/lib/librte_mbuf/meson.build +++ b/lib/librte_mbuf/meson.build @@ -3,7 +3,8 @@ version = 5 sources = files('rte_mbuf.c', 'rte_mbuf_ptype.c', 'rte_mbuf_pool_ops.c') -headers = files('rte_mbuf.h', 'rte_mbuf_ptype.h', 'rte_mbuf_pool_ops.h') +headers = files('rte_mbuf.h', 'rte_mbuf_core.h', + 'rte_mbuf_ptype.h', 'rte_mbuf_pool_ops.h') deps += ['mempool'] allow_experimental_apis = true diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h index 98225ec80..528b2f121 100644 --- a/lib/librte_mbuf/rte_mbuf.h +++ b/lib/librte_mbuf/rte_mbuf.h @@ -42,367 +42,12 @@ #include #include #include +#include #ifdef __cplusplus extern "C" { #endif -/* - * Packet Offload Features Flags. It also carry packet type information. - * Critical resources. Both rx/tx shared these bits. Be cautious on any change - * - * - RX flags start at bit position zero, and get added to the left of previous - * flags. - * - The most-significant 3 bits are reserved for generic mbuf flags - * - TX flags therefore start at bit position 60 (i.e. 63-3), and new flags get - * added to the right of the previously defined flags i.e. they should count - * downwards, not upwards. - * - * Keep these flags synchronized with rte_get_rx_ol_flag_name() and - * rte_get_tx_ol_flag_name(). - */ - -/** - * The RX packet is a 802.1q VLAN packet, and the tci has been - * saved in in mbuf->vlan_tci. - * If the flag PKT_RX_VLAN_STRIPPED is also present, the VLAN - * header has been stripped from mbuf data, else it is still - * present. - */ -#define PKT_RX_VLAN (1ULL << 0) - -#define PKT_RX_RSS_HASH (1ULL << 1) /**< RX packet with RSS hash result. */ -#define PKT_RX_FDIR (1ULL << 2) /**< RX packet with FDIR match indicate. */ - -/** - * Deprecated. - * Checking this flag alone is deprecated: check the 2 bits of - * PKT_RX_L4_CKSUM_MASK. - * This flag was set when the L4 checksum of a packet was detected as - * wrong by the hardware. - */ -#define PKT_RX_L4_CKSUM_BAD (1ULL << 3) - -/** - * Deprecated. - * Checking this flag alone is deprecated: check the 2 bits of - * PKT_RX_IP_CKSUM_MASK. - * This flag was set when the IP checksum of a packet was detected as - * wrong by the hardware. - */ -#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) - -#define PKT_RX_EIP_CKSUM_BAD (1ULL << 5) /**< External IP header checksum error. */ - -/** - * A vlan has been stripped by the hardware and its tci is saved in - * mbuf->vlan_tci. This can only happen if vlan stripping is enabled - * in the RX configuration of the PMD. - * When PKT_RX_VLAN_STRIPPED is set, PKT_RX_VLAN must also be set. - */ -#define PKT_RX_VLAN_STRIPPED (1ULL << 6) - -/** - * Mask of bits used to determine the status of RX IP checksum. - * - PKT_RX_IP_CKSUM_UNKNOWN: no information about the RX IP checksum - * - PKT_RX_IP_CKSUM_BAD: the IP checksum in the packet is wrong - * - PKT_R
Re: [dpdk-dev] [PATCH v2 1/3] eal: move CACHE and IOVA related definitions
Acked-by: Michel Machado [ ]'s Michel Machado On 9/27/19 9:50 AM, Konstantin Ananyev wrote: Right now RTE_CACHE_ and IOVA definitions are located inside rte_memory.h That might cause an unwanted inclusions of arch/os specific header files. See [1] for particular problem example. Probably the simplest way to deal with such problems - move these definitions into rte_commmon.h Note that this move doesn't introduce any change in functionality. [1] https://bugs.dpdk.org/show_bug.cgi?id=321 Suggested-by: Vipin Varghese Signed-off-by: Konstantin Ananyev --- lib/librte_eal/common/include/rte_common.h | 44 ++ lib/librte_eal/common/include/rte_memory.h | 38 --- 2 files changed, 44 insertions(+), 38 deletions(-) diff --git a/lib/librte_eal/common/include/rte_common.h b/lib/librte_eal/common/include/rte_common.h index 05a3a6401..c275093d7 100644 --- a/lib/librte_eal/common/include/rte_common.h +++ b/lib/librte_eal/common/include/rte_common.h @@ -291,6 +291,50 @@ rte_is_aligned(void *ptr, unsigned align) */ #define RTE_BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)])) +/*** RTE_CACHE related macros / + +#define RTE_CACHE_LINE_MASK (RTE_CACHE_LINE_SIZE-1) /**< Cache line mask. */ + +#define RTE_CACHE_LINE_ROUNDUP(size) \ + (RTE_CACHE_LINE_SIZE * ((size + RTE_CACHE_LINE_SIZE - 1) / \ + RTE_CACHE_LINE_SIZE)) +/**< Return the first cache-aligned value greater or equal to size. */ + +/**< Cache line size in terms of log2 */ +#if RTE_CACHE_LINE_SIZE == 64 +#define RTE_CACHE_LINE_SIZE_LOG2 6 +#elif RTE_CACHE_LINE_SIZE == 128 +#define RTE_CACHE_LINE_SIZE_LOG2 7 +#else +#error "Unsupported cache line size" +#endif + +#define RTE_CACHE_LINE_MIN_SIZE 64 /**< Minimum Cache line size. */ + +/** + * Force alignment to cache line. + */ +#define __rte_cache_aligned __rte_aligned(RTE_CACHE_LINE_SIZE) + +/** + * Force minimum cache line alignment. + */ +#define __rte_cache_min_aligned __rte_aligned(RTE_CACHE_LINE_MIN_SIZE) + +/*** PA/IOVA type definitions / + +typedef uint64_t phys_addr_t; /**< Physical address. */ +#define RTE_BAD_PHYS_ADDR ((phys_addr_t)-1) +/** + * IO virtual address type. + * When the physical addressing mode (IOVA as PA) is in use, + * the translation from an IO virtual address (IOVA) to a physical address + * is a direct mapping, i.e. the same value. + * Otherwise, in virtual mode (IOVA as VA), an IOMMU may do the translation. + */ +typedef uint64_t rte_iova_t; +#define RTE_BAD_IOVA ((rte_iova_t)-1) + /** * Combines 32b inputs most significant set bits into the least * significant bits to construct a value with the same MSBs as x diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h index 4717dcb43..38e00e382 100644 --- a/lib/librte_eal/common/include/rte_memory.h +++ b/lib/librte_eal/common/include/rte_memory.h @@ -39,44 +39,6 @@ enum rte_page_sizes { }; #define SOCKET_ID_ANY -1/**< Any NUMA socket. */ -#define RTE_CACHE_LINE_MASK (RTE_CACHE_LINE_SIZE-1) /**< Cache line mask. */ - -#define RTE_CACHE_LINE_ROUNDUP(size) \ - (RTE_CACHE_LINE_SIZE * ((size + RTE_CACHE_LINE_SIZE - 1) / RTE_CACHE_LINE_SIZE)) -/**< Return the first cache-aligned value greater or equal to size. */ - -/**< Cache line size in terms of log2 */ -#if RTE_CACHE_LINE_SIZE == 64 -#define RTE_CACHE_LINE_SIZE_LOG2 6 -#elif RTE_CACHE_LINE_SIZE == 128 -#define RTE_CACHE_LINE_SIZE_LOG2 7 -#else -#error "Unsupported cache line size" -#endif - -#define RTE_CACHE_LINE_MIN_SIZE 64 /**< Minimum Cache line size. */ - -/** - * Force alignment to cache line. - */ -#define __rte_cache_aligned __rte_aligned(RTE_CACHE_LINE_SIZE) - -/** - * Force minimum cache line alignment. - */ -#define __rte_cache_min_aligned __rte_aligned(RTE_CACHE_LINE_MIN_SIZE) - -typedef uint64_t phys_addr_t; /**< Physical address. */ -#define RTE_BAD_PHYS_ADDR ((phys_addr_t)-1) -/** - * IO virtual address type. - * When the physical addressing mode (IOVA as PA) is in use, - * the translation from an IO virtual address (IOVA) to a physical address - * is a direct mapping, i.e. the same value. - * Otherwise, in virtual mode (IOVA as VA), an IOMMU may do the translation. - */ -typedef uint64_t rte_iova_t; -#define RTE_BAD_IOVA ((rte_iova_t)-1) /** * Physical memory segment descriptor.
Re: [dpdk-dev] [PATCH v2 2/3] mbuf: move mbuf definition into a separate file
Acked-by: Michel Machado [ ]'s Michel Machado On 9/27/19 9:50 AM, Konstantin Ananyev wrote: Right now inclusion of rte_mbuf.h header can cause inclusion of some arch/os specific headers. That prevents it to be included directly by some non-DPDK (but related) entities: KNI, BPF programs, etc. To overcome that problem usually a separate definitions of rte_mbuf structure is created within these entities. That aproach has a lot of drawbacks: code duplication, error prone, etc. This patch moves rte_mbuf structure definition (and some related macros) into a separate file that can be included by both rte_mbuf.h and other non-DPDK entities. Note that it doesn't introduce any change for current DPDK code. Signed-off-by: Konstantin Ananyev --- lib/librte_mbuf/Makefile| 5 +- lib/librte_mbuf/meson.build | 3 +- lib/librte_mbuf/rte_mbuf.h | 738 + lib/librte_mbuf/rte_mbuf_core.h | 793 4 files changed, 800 insertions(+), 739 deletions(-) create mode 100644 lib/librte_mbuf/rte_mbuf_core.h diff --git a/lib/librte_mbuf/Makefile b/lib/librte_mbuf/Makefile index c8f6d2689..f3b76ad23 100644 --- a/lib/librte_mbuf/Makefile +++ b/lib/librte_mbuf/Makefile @@ -19,6 +19,9 @@ LIBABIVER := 5 SRCS-$(CONFIG_RTE_LIBRTE_MBUF) := rte_mbuf.c rte_mbuf_ptype.c rte_mbuf_pool_ops.c # install includes -SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include := rte_mbuf.h rte_mbuf_ptype.h rte_mbuf_pool_ops.h +SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include := rte_mbuf.h +SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include += rte_mbuf_core.h +SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include += rte_mbuf_ptype.h +SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include += rte_mbuf_pool_ops.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_mbuf/meson.build b/lib/librte_mbuf/meson.build index 6cc11ebb4..36bb6eb9d 100644 --- a/lib/librte_mbuf/meson.build +++ b/lib/librte_mbuf/meson.build @@ -3,7 +3,8 @@ version = 5 sources = files('rte_mbuf.c', 'rte_mbuf_ptype.c', 'rte_mbuf_pool_ops.c') -headers = files('rte_mbuf.h', 'rte_mbuf_ptype.h', 'rte_mbuf_pool_ops.h') +headers = files('rte_mbuf.h', 'rte_mbuf_core.h', + 'rte_mbuf_ptype.h', 'rte_mbuf_pool_ops.h') deps += ['mempool'] allow_experimental_apis = true diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h index 98225ec80..528b2f121 100644 --- a/lib/librte_mbuf/rte_mbuf.h +++ b/lib/librte_mbuf/rte_mbuf.h @@ -42,367 +42,12 @@ #include #include #include +#include #ifdef __cplusplus extern "C" { #endif -/* - * Packet Offload Features Flags. It also carry packet type information. - * Critical resources. Both rx/tx shared these bits. Be cautious on any change - * - * - RX flags start at bit position zero, and get added to the left of previous - * flags. - * - The most-significant 3 bits are reserved for generic mbuf flags - * - TX flags therefore start at bit position 60 (i.e. 63-3), and new flags get - * added to the right of the previously defined flags i.e. they should count - * downwards, not upwards. - * - * Keep these flags synchronized with rte_get_rx_ol_flag_name() and - * rte_get_tx_ol_flag_name(). - */ - -/** - * The RX packet is a 802.1q VLAN packet, and the tci has been - * saved in in mbuf->vlan_tci. - * If the flag PKT_RX_VLAN_STRIPPED is also present, the VLAN - * header has been stripped from mbuf data, else it is still - * present. - */ -#define PKT_RX_VLAN (1ULL << 0) - -#define PKT_RX_RSS_HASH (1ULL << 1) /**< RX packet with RSS hash result. */ -#define PKT_RX_FDIR (1ULL << 2) /**< RX packet with FDIR match indicate. */ - -/** - * Deprecated. - * Checking this flag alone is deprecated: check the 2 bits of - * PKT_RX_L4_CKSUM_MASK. - * This flag was set when the L4 checksum of a packet was detected as - * wrong by the hardware. - */ -#define PKT_RX_L4_CKSUM_BAD (1ULL << 3) - -/** - * Deprecated. - * Checking this flag alone is deprecated: check the 2 bits of - * PKT_RX_IP_CKSUM_MASK. - * This flag was set when the IP checksum of a packet was detected as - * wrong by the hardware. - */ -#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) - -#define PKT_RX_EIP_CKSUM_BAD (1ULL << 5) /**< External IP header checksum error. */ - -/** - * A vlan has been stripped by the hardware and its tci is saved in - * mbuf->vlan_tci. This can only happen if vlan stripping is enabled - * in the RX configuration of the PMD. - * When PKT_RX_VLAN_STRIPPED is set, PKT_RX_VLAN must also be set. - */ -#define PKT_RX_VLAN_STRIPPED (1ULL << 6) - -/** - * Mask of bits used to determine the status of RX IP checksum. - * - PKT_RX_IP_CKSUM_UNKNOWN: no information about the RX IP checksum - * - PKT_RX_IP_CKSUM_BAD: the IP checksum in the packet is wrong - * - PKT_R
Re: [dpdk-dev] [PATCH v2 3/3] examples/bpf: remove duplicate mbuf definition
Acked-by: Michel Machado [ ]'s Michel Machado On 9/27/19 9:50 AM, Konstantin Ananyev wrote: Get rid of duplicate definitions for rte_mbuf and related macros. Include rte_mbuf_core.h instead. Signed-off-by: Konstantin Ananyev --- examples/bpf/mbuf.h | 605 examples/bpf/t2.c | 4 +- examples/bpf/t3.c | 2 +- 3 files changed, 3 insertions(+), 608 deletions(-) delete mode 100644 examples/bpf/mbuf.h diff --git a/examples/bpf/mbuf.h b/examples/bpf/mbuf.h deleted file mode 100644 index e41c3d26e..0 --- a/examples/bpf/mbuf.h +++ /dev/null @@ -1,605 +0,0 @@ -/* SPDX-License-Identifier: BSD-3-Clause - * Copyright(c) 2010-2014 Intel Corporation. - * Copyright 2014 6WIND S.A. - */ - -/* - * Snipper from dpdk.org rte_mbuf.h. - * used to provide BPF programs information about rte_mbuf layout. - */ - -#ifndef _MBUF_H_ -#define _MBUF_H_ - -#include -#include - -#ifdef __cplusplus -extern "C" { -#endif - -/* - * Packet Offload Features Flags. It also carry packet type information. - * Critical resources. Both rx/tx shared these bits. Be cautious on any change - * - * - RX flags start at bit position zero, and get added to the left of previous - * flags. - * - The most-significant 3 bits are reserved for generic mbuf flags - * - TX flags therefore start at bit position 60 (i.e. 63-3), and new flags get - * added to the right of the previously defined flags i.e. they should count - * downwards, not upwards. - * - * Keep these flags synchronized with rte_get_rx_ol_flag_name() and - * rte_get_tx_ol_flag_name(). - */ - -/** - * RX packet is a 802.1q VLAN packet. This flag was set by PMDs when - * the packet is recognized as a VLAN, but the behavior between PMDs - * was not the same. This flag is kept for some time to avoid breaking - * applications and should be replaced by PKT_RX_VLAN_STRIPPED. - */ -#define PKT_RX_VLAN_PKT (1ULL << 0) - -#define PKT_RX_RSS_HASH (1ULL << 1) -/**< RX packet with RSS hash result. */ -#define PKT_RX_FDIR (1ULL << 2) -/**< RX packet with FDIR match indicate. */ - -/** - * Deprecated. - * Checking this flag alone is deprecated: check the 2 bits of - * PKT_RX_L4_CKSUM_MASK. - * This flag was set when the L4 checksum of a packet was detected as - * wrong by the hardware. - */ -#define PKT_RX_L4_CKSUM_BAD (1ULL << 3) - -/** - * Deprecated. - * Checking this flag alone is deprecated: check the 2 bits of - * PKT_RX_IP_CKSUM_MASK. - * This flag was set when the IP checksum of a packet was detected as - * wrong by the hardware. - */ -#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) - -#define PKT_RX_EIP_CKSUM_BAD (1ULL << 5) -/**< External IP header checksum error. */ - -/** - * A vlan has been stripped by the hardware and its tci is saved in - * mbuf->vlan_tci. This can only happen if vlan stripping is enabled - * in the RX configuration of the PMD. - */ -#define PKT_RX_VLAN_STRIPPED (1ULL << 6) - -/** - * Mask of bits used to determine the status of RX IP checksum. - * - PKT_RX_IP_CKSUM_UNKNOWN: no information about the RX IP checksum - * - PKT_RX_IP_CKSUM_BAD: the IP checksum in the packet is wrong - * - PKT_RX_IP_CKSUM_GOOD: the IP checksum in the packet is valid - * - PKT_RX_IP_CKSUM_NONE: the IP checksum is not correct in the packet - * data, but the integrity of the IP header is verified. - */ -#define PKT_RX_IP_CKSUM_MASK ((1ULL << 4) | (1ULL << 7)) - -#define PKT_RX_IP_CKSUM_UNKNOWN 0 -#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) -#define PKT_RX_IP_CKSUM_GOOD(1ULL << 7) -#define PKT_RX_IP_CKSUM_NONE((1ULL << 4) | (1ULL << 7)) - -/** - * Mask of bits used to determine the status of RX L4 checksum. - * - PKT_RX_L4_CKSUM_UNKNOWN: no information about the RX L4 checksum - * - PKT_RX_L4_CKSUM_BAD: the L4 checksum in the packet is wrong - * - PKT_RX_L4_CKSUM_GOOD: the L4 checksum in the packet is valid - * - PKT_RX_L4_CKSUM_NONE: the L4 checksum is not correct in the packet - * data, but the integrity of the L4 data is verified. - */ -#define PKT_RX_L4_CKSUM_MASK ((1ULL << 3) | (1ULL << 8)) - -#define PKT_RX_L4_CKSUM_UNKNOWN 0 -#define PKT_RX_L4_CKSUM_BAD (1ULL << 3) -#define PKT_RX_L4_CKSUM_GOOD(1ULL << 8) -#define PKT_RX_L4_CKSUM_NONE((1ULL << 3) | (1ULL << 8)) - -#define PKT_RX_IEEE1588_PTP (1ULL << 9) -/**< RX IEEE1588 L2 Ethernet PT Packet. */ -#define PKT_RX_IEEE1588_TMST (1ULL << 10) -/**< RX IEEE1588 L2/L4 timestamped packet.*/ -#define PKT_RX_FDIR_ID (1ULL << 13) -/**< FD id reported if FDIR match. */ -#define PKT_RX_FDIR_FLX (1ULL << 14) -/**< Flexible bytes reported if FDIR match. */ - -/** - * The 2 vlans have been stripped by the hardware and their tci are - * saved in mbuf->vlan_tci (inner) and mbuf->vlan_tci_outer (outer). - * This can only happen if vlan stripping is enable
Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data
On 10/14/20 7:57 PM, Honnappa Nagarahalli wrote: On 13/10/2020 18:46, Michel Machado wrote: On 10/13/20 11:41 AM, Medvedkin, Vladimir wrote: Hi Michel, Could you please describe a condition when LPM gets inconsistent? As I can see if there is no free tbl8 it will return -ENOSPC. Consider this simple example, we need to add the following two prefixes with different next hops: 10.99.0.0/16, 18.99.99.128/25. If the LPM table is out of tbl8s, the second prefix is not added and Gatekeeper will make decisions in violation of the policy. The data structure of the LPM table is consistent, but its content inconsistent with the policy. max_rules and number_tbl8s in 'struct rte_lpm' contain the config information. These 2 fields do not change based on the routes added and do not indicate the amount of space left. So, you cannot use this information to decide if there is enough space to add more routes. We are aware that those fields hold the config information not a status of the LPM table. Before updating a LPM table that holds network prefixes derived from threat intelligence, we compute the minimum values for max_rules and number_tbl8s. Here is an example of how we do it: https://github.com/AltraMayor/gatekeeper/blob/95d1d6e8201861a0d0c698bfd06ad606674f1e07/lua/examples/policy.lua#L135-L166 Once these minimum values are available, we get the parameters of the LPM table to be updated and check if we can update it, or have to recreate it. Aha, thanks. So do I understand correctly that you need to add a set of routes atomically (either the entire set is installed or nothing)? Yes. If so, then I would suggest having 2 lpm and switching them atomically after a successful addition. As for now, even if you have enough tbl8's, routes are installed non atomically, i.e. there will be a time gap between adding two routes, so in this time interval the table will be inconsistent with the policy. Also, if new lpm algorithms are added to the DPDK, they won't have such a thing as tbl8. Our code already deals with synchronization. If the application code already deals with synchronization, is it possible to revert back (i.e. delete the routes that got added so far) when the addition of the route-set fails? The way the code is structured, this would require a significant rewrite because the code assumes that it will succeed since the capacity of the LPM tables was already checked. On 13/10/2020 15:58, Michel Machado wrote: Hi Kevin, We do need fields max_rules and number_tbl8s of struct rte_lpm, so the removal would force us to have another patch to our local copy of DPDK. We'd rather avoid this new local patch because we wish to eventually be in sync with the stock DPDK. Those fields are needed in Gatekeeper because we found a condition in an ongoing deployment in which the entries of some LPM tables may suddenly change a lot to reflect policy changes. To avoid getting into a state in which the LPM table is inconsistent because it cannot fit all the new entries, we compute the needed parameters to support the new entries, and compare with the current parameters. If the current table doesn't fit everything, we have to replace it with a new LPM table. If there were a way to obtain the struct rte_lpm_config of a given LPM table, it would cleanly address our need. We have the same need in IPv6 and have a local patch to work around it (see https://github.com/cjdoucette/dpdk/commit/3eaf124a781349b8ec8cd880db 26a78115cb8c8f). I do not see why such an API is not possible, we could add one API that returns max_rules and number_tbl8s (essentially, the config that was passed to rte_lpm_create API). But, is there a possibility to store that info in the application as that data was passed to rte_lpm from the application? A suggestion for what this API could look like: void rte_lpm_get_config(const struct rte_lpm *lpm, struct rte_lpm_config *config); void rte_lpm6_get_config(const struct rte_lpm6 *lpm, struct rte_lpm6_config *config); If the final choice is for not supporting a way to retrieve the config information on the API, we'll look for a place to keep a copy of the parameters in our code.
Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data
If the final choice is for not supporting a way to retrieve the config information on the API, we'll look for a place to keep a copy of the parameters in our code. IMO, this is not a performance critical path and it is not a difficult solution to store these values in the application. My suggestion is to skip adding the API and store the values in the application. Vladimir, what's your opinion? Agree. Global vars or part of a global configuration could be used here. Thank you. I think we are fine to go ahead with merging this patch. Michel, I know it's a bit of churn so not zero effort, but is that solution workable for you? The change in DPDK would not be backported, so gatekeeper code would only need an update on updating to use DPDK 20.11. It's workable. Thank you for organizing this discussion, Kevin. P.S. As you are using DPDK 19.08, we should talk about DPDK LTS but let's leave that for another thread :-) Sure.
Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data
On 10/13/20 11:41 AM, Medvedkin, Vladimir wrote: Hi Michel, Could you please describe a condition when LPM gets inconsistent? As I can see if there is no free tbl8 it will return -ENOSPC. Consider this simple example, we need to add the following two prefixes with different next hops: 10.99.0.0/16, 18.99.99.128/25. If the LPM table is out of tbl8s, the second prefix is not added and Gatekeeper will make decisions in violation of the policy. The data structure of the LPM table is consistent, but its content inconsistent with the policy. We minimize the need of replacing a LPM table by allocating LPM tables with the double of what we need (see example here https://github.com/AltraMayor/gatekeeper/blob/95d1d6e8201861a0d0c698bfd06ad606674f1e07/lua/examples/policy.lua#L172-L183), but the code must be ready for unexpected needs that may arise in production. On 13/10/2020 15:58, Michel Machado wrote: Hi Kevin, We do need fields max_rules and number_tbl8s of struct rte_lpm, so the removal would force us to have another patch to our local copy of DPDK. We'd rather avoid this new local patch because we wish to eventually be in sync with the stock DPDK. Those fields are needed in Gatekeeper because we found a condition in an ongoing deployment in which the entries of some LPM tables may suddenly change a lot to reflect policy changes. To avoid getting into a state in which the LPM table is inconsistent because it cannot fit all the new entries, we compute the needed parameters to support the new entries, and compare with the current parameters. If the current table doesn't fit everything, we have to replace it with a new LPM table. If there were a way to obtain the struct rte_lpm_config of a given LPM table, it would cleanly address our need. We have the same need in IPv6 and have a local patch to work around it (see https://github.com/cjdoucette/dpdk/commit/3eaf124a781349b8ec8cd880db26a78115cb8c8f). Thus, an IPv4 and IPv6 solution would be best. PS: I've added Qiaobin Fu, another Gatekeeper maintainer, to this disscussion. [ ]'s Michel Machado On 10/13/20 9:53 AM, Kevin Traynor wrote: Hi Gatekeeper maintainers (I think), fyi - there is a proposal to remove some members of a struct in DPDK LPM API that Gatekeeper is using [1]. It would be only from DPDK 20.11 but as it's an LTS I guess it would probably hit Debian in a few months. The full thread is here: http://inbox.dpdk.org/dev/20200907081518.46350-1-ruifeng.w...@arm.com/ Maybe you can take a look and tell us if they are needed in Gatekeeper or you can workaround it? thanks, Kevin. [1] https://github.com/AltraMayor/gatekeeper/blob/master/gt/lua_lpm.c#L235-L248 On 09/10/2020 07:54, Ruifeng Wang wrote: -Original Message- From: Kevin Traynor Sent: Wednesday, September 30, 2020 4:46 PM To: Ruifeng Wang ; Medvedkin, Vladimir ; Bruce Richardson Cc: dev@dpdk.org; Honnappa Nagarahalli ; nd Subject: Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data On 16/09/2020 04:17, Ruifeng Wang wrote: -Original Message- From: Medvedkin, Vladimir Sent: Wednesday, September 16, 2020 12:28 AM To: Bruce Richardson ; Ruifeng Wang Cc: dev@dpdk.org; Honnappa Nagarahalli ; nd Subject: Re: [PATCH 2/2] lpm: hide internal data Hi Ruifeng, On 15/09/2020 17:02, Bruce Richardson wrote: On Mon, Sep 07, 2020 at 04:15:17PM +0800, Ruifeng Wang wrote: Fields except tbl24 and tbl8 in rte_lpm structure have no need to be exposed to the user. Hide the unneeded exposure of structure fields for better ABI maintainability. Suggested-by: David Marchand Signed-off-by: Ruifeng Wang Reviewed-by: Phil Yang --- lib/librte_lpm/rte_lpm.c | 152 +++--- - lib/librte_lpm/rte_lpm.h | 7 -- 2 files changed, 91 insertions(+), 68 deletions(-) diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h index 03da2d37e..112d96f37 100644 --- a/lib/librte_lpm/rte_lpm.h +++ b/lib/librte_lpm/rte_lpm.h @@ -132,17 +132,10 @@ struct rte_lpm_rule_info { /** @internal LPM structure. */ struct rte_lpm { - /* LPM metadata. */ - char name[RTE_LPM_NAMESIZE]; /**< Name of the lpm. */ - uint32_t max_rules; /**< Max. balanced rules per lpm. */ - uint32_t number_tbl8s; /**< Number of tbl8s. */ - struct rte_lpm_rule_info rule_info[RTE_LPM_MAX_DEPTH]; /**< Rule info table. */ - /* LPM Tables. */ struct rte_lpm_tbl_entry tbl24[RTE_LPM_TBL24_NUM_ENTRIES] __rte_cache_aligned; /**< LPM tbl24 table. */ struct rte_lpm_tbl_entry *tbl8; /**< LPM tbl8 table. */ - struct rte_lpm_rule *rules_tbl; /**< LPM rules. */ }; Since this changes the ABI, does it not need advance notice? [Basically the return value point from rte_lpm_create() will be different, and that return value could be used by rte_lpm_lookup() which as a static inline function will be in the binary and u
Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data
Hi Kevin, We do need fields max_rules and number_tbl8s of struct rte_lpm, so the removal would force us to have another patch to our local copy of DPDK. We'd rather avoid this new local patch because we wish to eventually be in sync with the stock DPDK. Those fields are needed in Gatekeeper because we found a condition in an ongoing deployment in which the entries of some LPM tables may suddenly change a lot to reflect policy changes. To avoid getting into a state in which the LPM table is inconsistent because it cannot fit all the new entries, we compute the needed parameters to support the new entries, and compare with the current parameters. If the current table doesn't fit everything, we have to replace it with a new LPM table. If there were a way to obtain the struct rte_lpm_config of a given LPM table, it would cleanly address our need. We have the same need in IPv6 and have a local patch to work around it (see https://github.com/cjdoucette/dpdk/commit/3eaf124a781349b8ec8cd880db26a78115cb8c8f). Thus, an IPv4 and IPv6 solution would be best. PS: I've added Qiaobin Fu, another Gatekeeper maintainer, to this disscussion. [ ]'s Michel Machado On 10/13/20 9:53 AM, Kevin Traynor wrote: Hi Gatekeeper maintainers (I think), fyi - there is a proposal to remove some members of a struct in DPDK LPM API that Gatekeeper is using [1]. It would be only from DPDK 20.11 but as it's an LTS I guess it would probably hit Debian in a few months. The full thread is here: http://inbox.dpdk.org/dev/20200907081518.46350-1-ruifeng.w...@arm.com/ Maybe you can take a look and tell us if they are needed in Gatekeeper or you can workaround it? thanks, Kevin. [1] https://github.com/AltraMayor/gatekeeper/blob/master/gt/lua_lpm.c#L235-L248 On 09/10/2020 07:54, Ruifeng Wang wrote: -Original Message- From: Kevin Traynor Sent: Wednesday, September 30, 2020 4:46 PM To: Ruifeng Wang ; Medvedkin, Vladimir ; Bruce Richardson Cc: dev@dpdk.org; Honnappa Nagarahalli ; nd Subject: Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data On 16/09/2020 04:17, Ruifeng Wang wrote: -Original Message- From: Medvedkin, Vladimir Sent: Wednesday, September 16, 2020 12:28 AM To: Bruce Richardson ; Ruifeng Wang Cc: dev@dpdk.org; Honnappa Nagarahalli ; nd Subject: Re: [PATCH 2/2] lpm: hide internal data Hi Ruifeng, On 15/09/2020 17:02, Bruce Richardson wrote: On Mon, Sep 07, 2020 at 04:15:17PM +0800, Ruifeng Wang wrote: Fields except tbl24 and tbl8 in rte_lpm structure have no need to be exposed to the user. Hide the unneeded exposure of structure fields for better ABI maintainability. Suggested-by: David Marchand Signed-off-by: Ruifeng Wang Reviewed-by: Phil Yang --- lib/librte_lpm/rte_lpm.c | 152 +++--- - lib/librte_lpm/rte_lpm.h | 7 -- 2 files changed, 91 insertions(+), 68 deletions(-) diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h index 03da2d37e..112d96f37 100644 --- a/lib/librte_lpm/rte_lpm.h +++ b/lib/librte_lpm/rte_lpm.h @@ -132,17 +132,10 @@ struct rte_lpm_rule_info { /** @internal LPM structure. */ struct rte_lpm { - /* LPM metadata. */ - char name[RTE_LPM_NAMESIZE];/**< Name of the lpm. */ - uint32_t max_rules; /**< Max. balanced rules per lpm. */ - uint32_t number_tbl8s; /**< Number of tbl8s. */ - struct rte_lpm_rule_info rule_info[RTE_LPM_MAX_DEPTH]; /**< Rule info table. */ - /* LPM Tables. */ struct rte_lpm_tbl_entry tbl24[RTE_LPM_TBL24_NUM_ENTRIES] __rte_cache_aligned; /**< LPM tbl24 table. */ struct rte_lpm_tbl_entry *tbl8; /**< LPM tbl8 table. */ - struct rte_lpm_rule *rules_tbl; /**< LPM rules. */ }; Since this changes the ABI, does it not need advance notice? [Basically the return value point from rte_lpm_create() will be different, and that return value could be used by rte_lpm_lookup() which as a static inline function will be in the binary and using the old structure offsets.] Agree with Bruce, this patch breaks ABI, so it can't be accepted without prior notice. So if the change wants to happen in 20.11, a deprecation notice should have been added in 20.08. I should have added a deprecation notice. This change will have to wait for next ABI update window. Do you plan to extend? or is this just speculative? It is speculative. A quick scan and there seems to be several projects using some of these members that you are proposing to hide. e.g. BESS, NFF-Go, DPVS, gatekeeper. I didn't look at the details to see if they are really needed. Not sure how much notice they'd need or if they update DPDK much, but I think it's worth having a closer look as to how they use lpm and what the impact to them is. Checked the projects listed above. BESS, NFF-Go and DPVS don't access the members to be hided. They
Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data
On 10/13/20 3:06 PM, Medvedkin, Vladimir wrote: On 13/10/2020 18:46, Michel Machado wrote: On 10/13/20 11:41 AM, Medvedkin, Vladimir wrote: Hi Michel, Could you please describe a condition when LPM gets inconsistent? As I can see if there is no free tbl8 it will return -ENOSPC. Consider this simple example, we need to add the following two prefixes with different next hops: 10.99.0.0/16, 18.99.99.128/25. If the LPM table is out of tbl8s, the second prefix is not added and Gatekeeper will make decisions in violation of the policy. The data structure of the LPM table is consistent, but its content inconsistent with the policy. Aha, thanks. So do I understand correctly that you need to add a set of routes atomically (either the entire set is installed or nothing)? Yes. If so, then I would suggest having 2 lpm and switching them atomically after a successful addition. As for now, even if you have enough tbl8's, routes are installed non atomically, i.e. there will be a time gap between adding two routes, so in this time interval the table will be inconsistent with the policy. Also, if new lpm algorithms are added to the DPDK, they won't have such a thing as tbl8. Our code already deals with synchronization. We minimize the need of replacing a LPM table by allocating LPM tables with the double of what we need (see example here https://github.com/AltraMayor/gatekeeper/blob/95d1d6e8201861a0d0c698bfd06ad606674f1e07/lua/examples/policy.lua#L172-L183), but the code must be ready for unexpected needs that may arise in production. Usually, the table is initialized with a large enough number of entries, enough to add a possible number of routes. One tbl8 group takes up 1Kb of memory which is nothing comparing to the size of tbl24 which is 64Mb. When the prefixes come from BGP, initializing a large enough table is fine. But when prefixes come from threat intelligence, the number of prefixes can vary wildly and the number of prefixes above 24 bits are way more common. P.S. consider using rte_fib library, it has a number of advantages over LPM. You can replace the loop in __lookup_fib_bulk() with a bulk lookup call and this will probably increase the speed. I'm not aware of the rte_fib library. The only documentation that I found on Google was https://doc.dpdk.org/api/rte__fib_8h.html and it just says "FIB (Forwarding information base) implementation for IPv4 Longest Prefix Match". On 13/10/2020 15:58, Michel Machado wrote: Hi Kevin, We do need fields max_rules and number_tbl8s of struct rte_lpm, so the removal would force us to have another patch to our local copy of DPDK. We'd rather avoid this new local patch because we wish to eventually be in sync with the stock DPDK. Those fields are needed in Gatekeeper because we found a condition in an ongoing deployment in which the entries of some LPM tables may suddenly change a lot to reflect policy changes. To avoid getting into a state in which the LPM table is inconsistent because it cannot fit all the new entries, we compute the needed parameters to support the new entries, and compare with the current parameters. If the current table doesn't fit everything, we have to replace it with a new LPM table. If there were a way to obtain the struct rte_lpm_config of a given LPM table, it would cleanly address our need. We have the same need in IPv6 and have a local patch to work around it (see https://github.com/cjdoucette/dpdk/commit/3eaf124a781349b8ec8cd880db26a78115cb8c8f). Thus, an IPv4 and IPv6 solution would be best. PS: I've added Qiaobin Fu, another Gatekeeper maintainer, to this disscussion. [ ]'s Michel Machado On 10/13/20 9:53 AM, Kevin Traynor wrote: Hi Gatekeeper maintainers (I think), fyi - there is a proposal to remove some members of a struct in DPDK LPM API that Gatekeeper is using [1]. It would be only from DPDK 20.11 but as it's an LTS I guess it would probably hit Debian in a few months. The full thread is here: http://inbox.dpdk.org/dev/20200907081518.46350-1-ruifeng.w...@arm.com/ Maybe you can take a look and tell us if they are needed in Gatekeeper or you can workaround it? thanks, Kevin. [1] https://github.com/AltraMayor/gatekeeper/blob/master/gt/lua_lpm.c#L235-L248 On 09/10/2020 07:54, Ruifeng Wang wrote: -Original Message- From: Kevin Traynor Sent: Wednesday, September 30, 2020 4:46 PM To: Ruifeng Wang ; Medvedkin, Vladimir ; Bruce Richardson Cc: dev@dpdk.org; Honnappa Nagarahalli ; nd Subject: Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data On 16/09/2020 04:17, Ruifeng Wang wrote: -Original Message- From: Medvedkin, Vladimir Sent: Wednesday, September 16, 2020 12:28 AM To: Bruce Richardson ; Ruifeng Wang Cc: dev@dpdk.org; Honnappa Nagarahalli ; nd Subject: Re: [PATCH 2/2] lpm: hide internal data Hi Ruifeng, On 15/09/2020 17:02, Bruce Ric