Re: [dpdk-dev] [PATCH] hash table: add a bucket iterator function

2018-08-01 Thread Michel Machado

On 07/31/2018 10:57 AM, Wiles, Keith wrote:



On Jul 31, 2018, at 1:09 AM, Fu, Qiaobin  wrote:

Hi Yipeng,

Thanks for the feedbacks!


On Jul 30, 2018, at 4:24 PM, Wang, Yipeng1  wrote:

Hi, Qiaobin,

Thanks for the patch. If I understand correctly your use case is to use hash table as a 
"cache" that new entries should evict stale ones automatically. Could you be 
more specific on your use scenarios so that I can have more context?



Actually, we didn’t use hash table as a “cache”, and there is no automation to 
evict stale ones. Instead, the functionrte_hash_bucket_iterate() allows the 
callers to iterate over the hash table in an incremental way, i.e., one bucket 
by another bucket, so that the caller doesn’t have to iterate over the whole 
hash table in one time. This way can avoid creating hiccups and achieve better 
cache performance. One possible use scenario is when DDoS attacks happen, one 
may want to take more CPU time to process the packets, thus iterating over the 
hash table incrementally is desirable.


I do not see this as a cache, but only a way to iterate over the hash table.




We are actually working on an extendable version of rte_hash which will link 
extra buckets to current table if the table is full. As long as the table is 
sized appropriately, there will be no conflict issues thus you don’t need to 
replace old entries. Will this solve your issue? Also, if the “cache” mode is 
what you want, we have the membership library “rte_member”. Is it applicable to 
your case?


I agree that adding extra buckets to current table when the table is full can 
alleviate this scenario. Indeed, we also considered this solution before coming 
up our proposed solution. However, it’s still highly desirable to have a bucket 
iterator function. Considering the scenario where the machine’s memory is 
limited and cannot always allocate new memory to the hash table (e.g., DDoS 
attacks, switch measurement tasks, etc.), a solution is to allow the callers 
evict some less important (the criteria for the eviction is based on the 
caller’s needs) entries.

Indeed, we don’t have the “cache” mode, our implementation is a way to achieve 
better cache performance. So, the rte_member won’t help here.



w.r.t. the patch set you proposed, my concern is that the new API will expose 
too much internals to the user, e.g. the bucket/entries structure should be 
implementation dependent. Exposing internal structures would prevent improving 
the implementation down the road unless we keep changing API which is not 
recommended.



The functions we add here give a way for the callers to iterate over the 
specific bucket, which potentially contains the entry. These functions can be 
made general enough to allow callers to heuristically evict some entries, and 
add the new ones to the hash table. Otherwise, there is no way to evict some 
less important entries.


We have many other iterators in DPDK and this one is no different. If we can 
hide the internals, then it would be good. The callback function should not 
expose any internals only the value in the hash table, which I believe is do 
just that, right?


   The use case that led to this iterator is the following: a hash 
table of flows is overloaded due to a DoS attack. The easy option is to 
ignore new flows, but this is not optimal when there's information 
pointing out that a given new flow is more important than one flow in 
the bucket in which the new flow must be inserted. So the high-level 
interpretation for this iterator is to find out which are the candidates 
such that one must be dropped to add a new flow. That is why the patch 
adds rte_hash_get_primary_bucket() and rte_hash_get_secondary_bucket(). 
We don't need more than these candidates because the memory lookups 
would be overwhelming. In fact, we have found that the eight candidates 
that rte_hash_get_primary_bucket() gives is already good enough.


   Once we solved the problem above, we've noticed that with small 
adjustments of the code, it would make it easy to scan a hash table for 
stale entries one bucket at a time instead of an entry at a time (the 
regular iterator). The idea is that once one reads the bucket location, 
the information about the all entries is coming together into the 
processor cache, so we can take advantage of the information that is 
already there.


   While the second feature is good to have, the first one is the 
feature we must have in our software.


[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH] hash table: add a bucket iterator function

2018-08-03 Thread Michel Machado

On 07/31/2018 09:40 PM, Wang, Yipeng1 wrote:

How about an API that is more universal? For example, an API such as 
"rte_iterate_conflict_entries".  After an insertion failure, this function will 
iterate all entries that may conflict with the newly inserted key and you could decide 
which entry to evict?


   Fine. We'll rewrite the patch to do so.

   Thank you for the feedback.

[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH] hash table: add a bucket iterator function

2018-08-08 Thread Michel Machado

On 08/03/2018 11:24 AM, Stephen Hemminger wrote:

Often for time based cleanup it is better to have a second linked list that is 
ordered
by time value. Then the cleanup code can start at the oldest stop when it 
reaches
the last item that could expire.

That does mean having some form of lock and doing delete/insert on every usage.

i.e 
spinlock(&timer_lock);
TAILQ_REMOVE(&timer_list, entry, timer_list);
entry->expiration = new time;
TAILQ_INSERT_TAIL(&timer_list, entry, timer_list);
spinunlock(&timer_unlock);


   We'll try it. Thanks for the suggestion, Stephen.

[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v2] hash table: add an iterator over conflicting entries

2018-08-21 Thread Michel Machado

On 08/16/2018 10:33 PM, Honnappa Nagarahalli wrote:

+/* Get the primary bucket index given the precomputed hash value. */
+static inline uint32_t rte_hash_get_primary_bucket(const struct
+rte_hash *h, hash_sig_t sig) {
+   return sig & h->bucket_bitmask;
+}
+
+/* Get the secondary bucket index given the precomputed hash value. */
+static inline uint32_t rte_hash_get_secondary_bucket(const struct
+rte_hash *h, hash_sig_t sig) {
+   return rte_hash_secondary_hash(sig) & h->bucket_bitmask; }
+
IMO, to keep the code consistent, we do not need to have the above 2 functions.


   Ok.


+int32_t __rte_experimental
+rte_hash_iterate_conflict_entries(struct rte_conflict_iterator_state *state,
+   const void **key, const void **data)
+{
+   struct rte_hash_iterator_conflict_entries_state *__state;
+
+   RETURN_IF_TRUE(((state == NULL) || (key == NULL) ||
+   (data == NULL)), -EINVAL);
+
+   __state = (struct rte_hash_iterator_conflict_entries_state *)state;
+
+   while (__state->vnext < RTE_HASH_BUCKET_ENTRIES * 2) {
+   uint32_t bidx = (__state->vnext < RTE_HASH_BUCKET_ENTRIES) ?
+   __state->primary_bidx : __state->secondary_bidx;
+   uint32_t next = __state->vnext & (RTE_HASH_BUCKET_ENTRIES - 1);
+   uint32_t position = __state->h->buckets[bidx].key_idx[next];
+   struct rte_hash_key *next_key;
+   /*
+* The test below is unlikely because this iterator is meant
+* to be used after a failed insert.
+* */
+   if (unlikely(position == EMPTY_SLOT))
+   goto next;
+
+   /* Get the entry in key table. */
+   next_key = (struct rte_hash_key *) (
+   (char *)__state->h->key_store +
+   position * __state->h->key_entry_size);
+   /* Return key and data. */
+   *key = next_key->key;
+   *data = next_key->pdata;
+
+next:
+   /* Increment iterator. */
+   __state->vnext++;
+
+   if (likely(position != EMPTY_SLOT))
+   return position - 1;
+   }
+
+   return -ENOENT;
+}


I think, we can make this API similar to 'rte_hash_iterate'. I suggest the 
following API signature:

int32_t
rte_hash_iterate_conflict_entries (const struct rte_hash *h, const void **key, 
void **data, hash_sig_t sig, uint32_t *next)


   The goal of our interface is to support changing the underlying hash 
table algorithm without requiring changes in applications. As Yipeng1 
Wang exemplified in the discussion of the first version of this patch, 
"in future, rte_hash may use three hash functions, or as I mentioned 
each bucket may have an additional linked list or even a second level 
hash table, or if the hopscotch hash replaces cuckoo hash as the new 
algorithm." These new algorithms may require more state than sig and 
next can efficiently provide in order to browse the conflicting entries.



I also suggest to change the API name to ' rte_hash_iterate_bucket_entries' - 
'bucket' is a well understood term in the context of hash algorithms.


   It's a matter of semantics here. rte_hash_iterate_conflict_entries() 
may cross more than one bucket. In fact, the first version of this patch 
tried to do exactly that, but it exposes the underlying algorithm. In 
addition, future algorithms may stretch what is being browsed even further.



Do we also need to have 'rte_hash_iterate_conflict_entries_with_hash' API?


   I may have not understood the question. We are already working with 
the hash (i.e. sig). Did you mean something else?



diff --git a/lib/librte_hash/rte_hash.h b/lib/librte_hash/rte_hash.h index 
f71ca9fbf..7ecb6a7eb 100644
--- a/lib/librte_hash/rte_hash.h
+++ b/lib/librte_hash/rte_hash.h
@@ -61,6 +61,11 @@ struct rte_hash_parameters {
  /** @internal A hash table structure. */  struct rte_hash;
  
+/** @internal A hash table conflict iterator state structure. */ struct

+rte_conflict_iterator_state {
+   uint8_t space[64];
+};
+

The size depends on the current size of the state, which is subject to change 
with the algorithm used.


   We chose a size that should be robust for any future underlying 
algorithm. Do you have a suggestion on how to go about it? We chose to 
have a simple struct to enable applications to allocate a state as a 
local variable and avoid a memory allocation.


[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v2] hash table: add an iterator over conflicting entries

2018-08-21 Thread Michel Machado

On 08/17/2018 03:41 PM, Honnappa Nagarahalli wrote:

Do we also need to have 'rte_hash_iterate_conflict_entries_with_hash' API?


 I may have not understood the question. We are already working with the 
hash (i.e. sig). Did you mean something else?

Let me elaborate. For the API 'rte_hash_lookup', there are multiple variations 
such as 'rte_hash_lookup_with_hash', 'rte_hash_lookup_data', 
'rte_hash_lookup_with_hash_data' etc. We do not need to create similar 
variations for 'rte_hash_iterate_conflict_entries' API right now. But the 
naming of the API should be such that these variations can be created in the 
future.


   So you mean that we should actually name 
rte_hash_iterator_conflict_entries_init() as 
rte_hash_iterator_conflict_entries_init_with_hash()? I'd be fine with this.



diff --git a/lib/librte_hash/rte_hash.h b/lib/librte_hash/rte_hash.h
index f71ca9fbf..7ecb6a7eb 100644
--- a/lib/librte_hash/rte_hash.h
+++ b/lib/librte_hash/rte_hash.h
@@ -61,6 +61,11 @@ struct rte_hash_parameters {
   /** @internal A hash table structure. */  struct rte_hash;
   
+/** @internal A hash table conflict iterator state structure. */

+struct rte_conflict_iterator_state {
+   uint8_t space[64];
+};
+

Needs aligning to cache line.


   Ok.


The size depends on the current size of the state, which is subject to change 
with the algorithm used.


 We chose a size that should be robust for any future underlying algorithm. 
Do you have a suggestion on how to go about it? We chose to have a simple 
struct to enable applications to allocate a state as a local variable and avoid 
a memory allocation.

This looks fine after your explanation. The structure name can be changed to 
'rte_iterator_state' so that it can be used in other iterator APIs too.


   I like this suggestion. What about the name 
"rte_hash_iterator_state" to make it specific to the hash table?


[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v2] hash table: add an iterator over conflicting entries

2018-08-21 Thread Michel Machado

On 08/17/2018 03:41 PM, Honnappa Nagarahalli wrote:

Can you elaborate more on using ' struct rte_conflict_iterator_state' as the 
argument for the API?

If the API signature is changed to: rte_hash_iterate_conflict_entries (const 
struct rte_hash *h, void **key, void **data, const hash_sig_t sig, struct 
rte_conflict_iterator_state *state) - it will be inline with the existing APIs. 
Contents of 'state' must be initialized to 0 for the first call. This will also 
avoid creating 'rte_hash_iterator_conflict_entries_init' API.


   Testing `state' every time rte_hash_iterate_conflict_entries() is 
called to find out if it's the first call of the iterator will possibly 
add some small, but unnecessary, overhead on 
rte_hash_iterate_conflict_entries() and constraints on struct 
rte_conflict_iterator_state. Moreover, 
rte_hash_iterator_conflict_entries_init() enables one to easily add 
variations of the init function to initialize the state (e.g. using a 
key instead of a sig) and still use the exactly same iterator.


[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v2] hash table: add an iterator over conflicting entries

2018-08-21 Thread Michel Machado

On 08/21/2018 01:10 AM, Honnappa Nagarahalli wrote:

On 08/17/2018 03:41 PM, Honnappa Nagarahalli wrote:

Can you elaborate more on using ' struct rte_conflict_iterator_state' as the 
argument for the API?

If the API signature is changed to: rte_hash_iterate_conflict_entries (const 
struct rte_hash *h, void **key, void **data, const hash_sig_t sig, struct 
rte_conflict_iterator_state *state) - it will be inline with the existing APIs. 
Contents of 'state' must be initialized to 0 for the first call. This will also 
avoid creating 'rte_hash_iterator_conflict_entries_init' API.


 Testing `state' every time rte_hash_iterate_conflict_entries() is called 
to find out if it's the first call of the iterator will possibly add some 
small, but unnecessary, overhead on
rte_hash_iterate_conflict_entries() and constraints on struct 
rte_conflict_iterator_state. Moreover,
rte_hash_iterator_conflict_entries_init() enables one to easily add variations 
of the init function to initialize the state (e.g. using a key instead of a 
sig) and still use the exactly same iterator.

IMO, I think, this over-head will be trivial. Looking at the function 
'rte_hash_iterate_conflict_entries' the check for '(__state->vnext < 
RTE_HASH_BUCKET_ENTRIES * 2)' already exists. If the primary/secondary bucket indices 
are calculated as well in 'rte_hash_iterate_conflict_entries' API ('rte_hash_iterate' 
API does such calculations), storing them in the state can be avoided. I am wondering 
if it makes sense to benchmark with these changes and then take a decision?


   We have come up with the init function and struct 
rte_conflict_iterator_state in v2 to make the new iterator as future 
proof to a change of the underlying algorithm as possible. But going 
through your feedback, it seems to me that your top concern is to not 
deviate much of the current interface of rte_hash_iterate(). We are fine 
with pushing v3 using the interface you've suggested to avoid the init 
function and struct rte_conflict_iterator_state:


int32_t
rte_hash_iterate_conflict_entries__with_hash(const struct rte_hash *h, 
const void **key, void **data, hash_sig_t sig, uint32_t *next);


[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v2] hash table: add an iterator over conflicting entries

2018-08-25 Thread Michel Machado



On 08/23/2018 08:33 PM, Wang, Yipeng1 wrote:

I think with Honnappa suggested "uint32_t* next",
we may need a little bit tricks to make it work with the extra linked list.
The performance may not be optimal though comparing to your original approach.
Is this important to your use case?


   It is. We are developing a DDoS protection system, and have chosen 
DPDK because it was the fastest framework in the evaluations we 
considered. We need to find the conflicting entries when a critical flow 
table of our system is overloaded due to an ongoing attack, so the more 
efficient we can evaluate the merits of an incoming flow against the 
conflicting flows already in the table, the higher the chances we find 
the flows that should be in the flow table.


   We've compromised with Honnappa under the understanding that once 
the underlying algorithm changes, there would be a review of the 
interface since even rte_hash_iterate() may be affected. I still think 
that the v2 we proposed is the best approach here because it isolates 
the interface from the underlying algorithm.


[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v2] hash table: add an iterator over conflicting entries

2018-08-28 Thread Michel Machado

On 08/26/2018 11:12 PM, Honnappa Nagarahalli wrote:

On 08/23/2018 08:33 PM, Wang, Yipeng1 wrote:

I think with Honnappa suggested "uint32_t* next", we may need a little
bit tricks to make it work with the extra linked list.
The performance may not be optimal though comparing to your original approach.
Is this important to your use case?


 It is. We are developing a DDoS protection system, and have chosen DPDK 
because it was the fastest framework in the evaluations we considered. We need 
to find the conflicting entries when a critical flow table of our system is 
overloaded due to an ongoing attack, so the more efficient we can evaluate the 
merits of an incoming flow against the conflicting flows already in the table, 
the higher the chances we find the flows that should be in the flow table.

 We've compromised with Honnappa under the understanding that once the 
underlying algorithm changes, there would be a review of the interface since 
even rte_hash_iterate() may be affected. I still think that the v2 we proposed 
is the best approach here because it isolates the interface from the underlying 
algorithm.

My only concern was to do with keeping the interfaces across APIs consistent. I 
am fine with changing 'uint32_t *next' as long as we change 'rte_hash_iterate' 
API as well.


   We'll patch rte_hash_iterate() as well in v3.

[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries

2018-09-06 Thread Michel Machado

Hi Yipeng,

On 09/04/2018 02:55 PM, Wang, Yipeng1 wrote:

Do we need both of the state and istate struct? struct rte_hash_iterator_state  
seems not doing much.
How about we only have one "state" struct and just not expose the internals to 
the public API, similar to the
rte_hash struct or rte_member_setsum struct.
And in _init function use rte_malloc to allocate the state and add a _free 
function to free it.


   The purpose of have struct state is to enable applications to 
allocate iterator states on their execution stack or embedding iterator 
states in larger structs to avoid an extra malloc()/free().


   Do you foresee that the upcoming new underlying algorithm of hash 
tables will need to dynamically allocate iterator states?


[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries

2018-09-06 Thread Michel Machado
te *state, const void **key, void **data)


How about "rte_hash_dup_next()"?
Also, please break the parameter list instead of having an empty first
line.


   We are preserving the name convention used in DPDK (e.g. 
rte_hash_iterate()).


   We'll break the parameter list to avoid the empty first line.


diff --git a/lib/librte_hash/rte_hash_version.map 
b/lib/librte_hash/rte_hash_version.map
index e216ac8e2..301d4638c 100644
--- a/lib/librte_hash/rte_hash_version.map
+++ b/lib/librte_hash/rte_hash_version.map
@@ -24,6 +24,7 @@ DPDK_2.1 {
  
  	rte_hash_add_key_data;

rte_hash_add_key_with_hash_data;
+   rte_hash_iterator_init;
rte_hash_iterate;
rte_hash_lookup_bulk_data;
rte_hash_lookup_data;
@@ -53,3 +54,10 @@ DPDK_18.08 {
rte_hash_count;
  
  } DPDK_16.07;

+
+EXPERIMENTAL {
+   global:
+
+   rte_hash_iterator_conflict_entries_init_with_hash;
+   rte_hash_iterate_conflict_entries;
+};


A blank line may be missing here.


   Do you mean a blank line before "};"?

   Thank you for the careful review.

[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries

2018-09-06 Thread Michel Machado

Hi Yipeng,

On 09/04/2018 03:51 PM, Wang, Yipeng1 wrote:

Hmm, I guess my comment is for code readability. If we don’t need the extra 
state that would be great.


   Notice that applications only see the public, opaque state. And the 
private versions are scoped to the C file where they are needed.



I think "rte_hash" is defined as an internal data structure but expose the type 
to the public header. Would this work?


   Exposing the private fields would bind the interface with the 
current implementation of the hash table. In the way we are proposing, 
one should be able to replace the underlying algorithm and not touching 
the header files that applications use. But, yes, your solution would 
enable applications to allocate iterator states as local variables as well.



I propose to malloc inside function mostly because I think it is cleaner to the 
user. But your argument is
valid. Depending on use case I think it is OK.


   I don't know how other applications will use this iterator, but we 
use it when our application is already overloaded. So avoiding an 
expensive operation like malloc() is a win.



Another comment is you put the total_entry in the state, is it for performance 
of the rte_hash_iterate?


   We are saving one integer multiplication per call of 
rte_hash_iterate(). It's not a big deal, but since there's room in the 
state variable, we thought this would be a good idea because it grows 
with the size of the table. We didn't actually measure the effect of 
this decision.



If you use it to iterate conflict entries, especially If you reuse same "state" 
struct and init it again and again for different keys,
would this slow down the performance for your specific use case?


   Notice that the field total_entry only exists for 
rte_hash_iterate(). But even if total_entry were in the state of 
rte_hash_iterate_conflict_entries(), it would still save on the 
multiplication as long as rte_hash_iterate_conflict_entries() is called 
at least twice. Calling rte_hash_iterate_conflict_entries() once evens 
out, and calling rte_hash_iterate_conflict_entries() more times adds 
further savings. As a side note. in our application, whenever an 
iterator of conflicting entries is initialized, we call 
rte_hash_iterate_conflict_entries() at least once.



Also iterate_conflic_entry may need reader lock protection.


   We are going to add the reader lock protection. Thanks.

[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries

2018-09-06 Thread Michel Machado

Hi Yipeng,

On 09/04/2018 04:57 PM, Wang, Yipeng1 wrote:

-Original Message-
From: Michel Machado [mailto:mic...@digirati.com.br]



Exposing the private fields would bind the interface with the
current implementation of the hash table. In the way we are proposing,
one should be able to replace the underlying algorithm and not touching
the header files that applications use. But, yes, your solution would
enable applications to allocate iterator states as local variables as well.



[Wang, Yipeng] I didn't mean to expose the private fields. But only the
Type. For example, rte_hash does not expose its private fields to users.
One can change the fields without changing API.


   The fact that struct rte_hash does not expose its private fields but 
only its type to applications means that a compiler cannot find out the 
byte length of struct rte_hash using only the header rte_hash.h. Thus, 
an application cannot allocate memory on its own (e.g. as a local 
variable) for a struct rte_hash. An application can, however, have a 
pointer to a struct rte_hash since the byte length of a pointer only 
depends on the architecture of the machine. This is the motivation 
behind having struct rte_hash_iterator_state in rte_hash.h only holding 
an array of bytes.


   There are good reasons to implement struct rte_hash as it is. For 
examples, struct rte_hash can change its byte length between versions of 
DPDK even if applications are dynamically linked to DPDK and not 
recompiled. Moreover a hash table is unlikely to be so short-lived as an 
iterator.


[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries

2018-09-06 Thread Michel Machado
>key_entry_size);
/* Return key and data */
*key = next_key->key;
*data = next_key->pdata;
  
-	__hash_rw_reader_unlock(h);

+   __hash_rw_reader_unlock(__state->h);
  
  	/* Increment iterator */

-   (*next)++;
+   __state->next++;
This comment is not related to this change, it is better to place this inside 
the lock.


   Even though __state->next does not depend on the lock?


return position - 1;
  }
+
+/* istate stands for internal state. */ struct
+rte_hash_iterator_conflict_entries_istate {
+   const struct rte_hash *h;
This can be moved outside of this structure.


   Discussed earlier.


+   uint32_t  vnext;
+   uint32_t  primary_bidx;
+   uint32_t  secondary_bidx;
+};
+
+int32_t __rte_experimental
+rte_hash_iterator_conflict_entries_init_with_hash(const struct rte_hash *h,
+   hash_sig_t sig, struct rte_hash_iterator_state *state) {
+   struct rte_hash_iterator_conflict_entries_istate *__state;
+
+   RETURN_IF_TRUE(((h == NULL) || (state == NULL)), -EINVAL);
+
+   __state = (struct rte_hash_iterator_conflict_entries_istate *)state;
+   __state->h = h;
+   __state->vnext = 0;
+
+   /* Get the primary bucket index given the precomputed hash value. */
+   __state->primary_bidx = sig & h->bucket_bitmask;
+   /* Get the secondary bucket index given the precomputed hash value. */
+   __state->secondary_bidx =
+   rte_hash_secondary_hash(sig) & h->bucket_bitmask;
+
+   return 0;
+}
IMO, as mentioned above, it is possible to avoid creating this API.


   Discussed earlier.


+
+int32_t __rte_experimental
+rte_hash_iterate_conflict_entries(
+   struct rte_hash_iterator_state *state, const void **key, void **data)
Signature of this API can be changed as follows:
rte_hash_iterate_conflict_entries(
struct rte_hash *h, const void **key, void **data, struct 
rte_hash_iterator_state *state)


   Discussed earlier.


+{
+   struct rte_hash_iterator_conflict_entries_istate *__state;
+
+   RETURN_IF_TRUE(((state == NULL) || (key == NULL) ||
+   (data == NULL)), -EINVAL);
+
+   __state = (struct rte_hash_iterator_conflict_entries_istate *)state;
+
+   while (__state->vnext < RTE_HASH_BUCKET_ENTRIES * 2) {
+   uint32_t bidx = __state->vnext < RTE_HASH_BUCKET_ENTRIES ?
+   __state->primary_bidx : __state->secondary_bidx;
+   uint32_t next = __state->vnext & (RTE_HASH_BUCKET_ENTRIES - 1);

take the reader lock before reading bucket entry


   Thanks for pointing this out. We are going to do so. The lock came 
in as we go through the versions of this patch.



+   uint32_t position = __state->h->buckets[bidx].key_idx[next];
+   struct rte_hash_key *next_key;
+
+   /* Increment iterator. */
+   __state->vnext++;
+
+   /*
+* The test below is unlikely because this iterator is meant
+* to be used after a failed insert.
+*/
+   if (unlikely(position == EMPTY_SLOT))
+   continue;
+
+   /* Get the entry in key table. */
+   next_key = (struct rte_hash_key *) (
+   (char *)__state->h->key_store +
+   position * __state->h->key_entry_size);
+   /* Return key and data. */
+   *key = next_key->key;
+   *data = next_key->pdata;
give the reader lock


   We'll do so.


+
+   return position - 1;
+   }
+
+   return -ENOENT;
+}
diff --git a/lib/librte_hash/rte_hash.h b/lib/librte_hash/rte_hash.h index 
9e7d9315f..fdb01023e 100644
--- a/lib/librte_hash/rte_hash.h
+++ b/lib/librte_hash/rte_hash.h
@@ -14,6 +14,8 @@
  #include 
  #include 
  
+#include 

+
  #ifdef __cplusplus
  extern "C" {
  #endif
@@ -64,6 +66,16 @@ struct rte_hash_parameters {
  /** @internal A hash table structure. */  struct rte_hash;
  
+/**

+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @internal A hash table iterator state structure.
+ */
+struct rte_hash_iterator_state {
+   uint8_t space[64];
I would call this 'state'. 64 can be replaced by 'RTE_CACHE_LINE_SIZE'.


   Okay.

[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries

2018-09-07 Thread Michel Machado
RIES;
-   idx = *next % RTE_HASH_BUCKET_ENTRIES;
+   bucket_idx = __state->next / RTE_HASH_BUCKET_ENTRIES;
+   idx = __state->next % RTE_HASH_BUCKET_ENTRIES;
}
-   __hash_rw_reader_lock(h);
+   __hash_rw_reader_lock(__state->h);
/* Get position of entry in key table */
-   position = h->buckets[bucket_idx].key_idx[idx];
-   next_key = (struct rte_hash_key *) ((char *)h->key_store +
-   position * h->key_entry_size);
+   position = __state->h->buckets[bucket_idx].key_idx[idx];
+   next_key = (struct rte_hash_key *) ((char *)__state->h->key_store +
+   position * __state->h->key_entry_size);
/* Return key and data */
*key = next_key->key;
*data = next_key->pdata;
   
-	__hash_rw_reader_unlock(h);

+   __hash_rw_reader_unlock(__state->h);
   
   	/* Increment iterator */

-   (*next)++;
+   __state->next++;
This comment is not related to this change, it is better to place this inside 
the lock.


 Even though __state->next does not depend on the lock?

It depends on if this API needs to be multi-thread safe. Interestingly, the 
documentation does not say it is multi-thread safe. If it has to be 
multi-thread safe, then the state also needs to be protected. For ex: what 
happens if the user uses a global variable for the state?


   If an application needs to share an iterator state between threads, 
it has to have a synchronization mechanism for that as it would for any 
other shared variable. The lock above is allowing applications to share 
a hash table between threads, it has no semantic over anything else.



diff --git a/lib/librte_hash/rte_hash.h b/lib/librte_hash/rte_hash.h
index 9e7d9315f..fdb01023e 100644
--- a/lib/librte_hash/rte_hash.h
+++ b/lib/librte_hash/rte_hash.h
@@ -14,6 +14,8 @@
   #include 
   #include 
   
+#include 

+
   #ifdef __cplusplus
   extern "C" {
   #endif
@@ -64,6 +66,16 @@ struct rte_hash_parameters {
   /** @internal A hash table structure. */  struct rte_hash;
   
+/**

+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @internal A hash table iterator state structure.
+ */
+struct rte_hash_iterator_state {
+   uint8_t space[64];
I would call this 'state'. 64 can be replaced by 'RTE_CACHE_LINE_SIZE'.


 Okay.


   I think we should not replace 64 with RTE_CACHE_LINE_SIZE because 
the ABI would change based on the architecture for which it's compiled.


[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries

2018-09-07 Thread Michel Machado

On 09/05/2018 04:27 PM, Wang, Yipeng1 wrote:

Hmm I see, it falls back to my original thought to have malloc inside the init 
function..
Thanks for the explanation. :)

So I guess with your implementation, in future if we change the internal state 
to be larger,
the ABI will be broken.


   If that happens, yes, the ABI would need to change again. But this 
concern is overblown for two reasons. First, this event is unlikely to 
happen because struct rte_hash_iterator_state is already allocating 64 
bytes while struct rte_hash_iterator_istate and struct 
rte_hash_iterator_conflict_entries_istate consume 16 and 20 bytes, 
respectively. Thus, the complexity of the underlying hash algorithm 
would need to grow substantially to force the necessary state of these 
iterators to grow more than 4x and 3x, respectively. This is unlikely to 
happen, and, if it does, it would likely break the ABI somewhere else 
and have a high impact on applications anyway.


   Second, even if the unlikely event happens, all one would need to do 
is to increase the size of struct rte_hash_iterator_state, mark the new 
API as a new version, and applications would be ready for the new ABI 
just recompiling.



BTW, this patch set also changes API so proper notice is needed.
People more familiar with API/ABI change policies may be able to help here.


   We'd be happy to get feedback on this aspect.


Just to confirm, is there anyway like I said for your application to have some 
long-live states
and reuse them throughout the application so that you don’t have to have 
short-lived ones in stack?


   Two things would need to happen for this to be possible. The init 
functions would need to accept previously allocated iterator states, 
that is, the init function would act as a reset of the state when acting 
on a previous allocated state. And, applications would now need to carry 
these pre-allocated state to avoid a malloc. In order words, we'll 
increase the complexity of the API.


   To emphasize that the cost of a malloc is not negligible, 
rte_malloc() needs to get a spinlock (see heap_alloc_on_socket()), do 
its thing to allocate memory, and, if the first attempt fails, try to 
allocate the memory on other sockets (see end of malloc_heap_alloc()). 
For an iterator that goes through the whole hash table, this cost may be 
okay, but for an iterator that goes through a couple entries, this cost 
is a lot to add.


   This memory allocation concern is not new. Function 
rte_pktmbuf_read(), for example, let applications pass buffers, which 
are often allocated in the execution stack, to avoid the malloc cost.


[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH v3] hash table: add an iterator over conflicting entries

2018-09-21 Thread Michel Machado

On 09/12/2018 04:37 PM, Honnappa Nagarahalli wrote:

+int32_t
+rte_hash_iterator_init(const struct rte_hash *h,
+   struct rte_hash_iterator_state *state) {
+   struct rte_hash_iterator_istate *__state;
'__state' can be replaced by 's'.

+
+   RETURN_IF_TRUE(((h == NULL) || (state == NULL)), -EINVAL);
+
+   __state = (struct rte_hash_iterator_istate *)state;
+   __state->h = h;
+   __state->next = 0;
+   __state->total_entries = h->num_buckets * RTE_HASH_BUCKET_ENTRIES;
+
+   return 0;
+}
IMO, creating this API can be avoided if the initialization is handled in 
'rte_hash_iterate' function. The cost of doing this is very trivial (one extra 
'if' statement) in 'rte_hash_iterate' function. It will help keep the number of 
APIs to minimal.


  Applications would have to initialize struct rte_hash_iterator_state 
*state before calling rte_hash_iterate() anyway. Why not initializing the 
fields of a state only once?

My concern is about creating another API for every iterator API. You have a 
valid point on saving cycles as this API applies for data plane. Have you done 
any performance benchmarking with and without this API? May be we can guide our 
decision based on that.


 It's not just about creating one init function for each iterator because 
an iterator may have a couple of init functions. For example, someone may 
eventually find useful to add another init function for the conflicting-entry 
iterator that we are advocating in this patch. A possibility would be for this 
new init function to use the key of the new entry instead of its signature to 
initialize the state. Similar to what is already done in rte_hash_lookup*() 
functions. In spite of possibly having multiple init functions, there will be a 
single iterator function.

 About the performance benchmarking, the current API only requites 
applications to initialize a single 32-bit integer. But with the adoption of a 
struct for the state, the initialization will grow to 64 bytes.

As my tests showed, I do not see any impact of this.


   Ok, we are going to eliminate the init functions in v4.


diff --git a/lib/librte_hash/rte_hash.h b/lib/librte_hash/rte_hash.h
index 9e7d9315f..fdb01023e 100644
--- a/lib/librte_hash/rte_hash.h
+++ b/lib/librte_hash/rte_hash.h
@@ -14,6 +14,8 @@
#include 
#include 

+#include 

+
#ifdef __cplusplus
extern "C" {
#endif
@@ -64,6 +66,16 @@ struct rte_hash_parameters {
/** @internal A hash table structure. */  struct rte_hash;

+/**

+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @internal A hash table iterator state structure.
+ */
+struct rte_hash_iterator_state {
+   uint8_t space[64];
I would call this 'state'. 64 can be replaced by 'RTE_CACHE_LINE_SIZE'.


  Okay.


 I think we should not replace 64 with RTE_CACHE_LINE_SIZE because the ABI 
would change based on the architecture for which it's compiled.

Ok. May be have a #define for 64?


   Ok.

[ ]'s
Michel Machado


Re: [dpdk-dev] [PATCH 3/3] examples/bpf: remove duplicate mbuf definition

2019-08-20 Thread Michel Machado
Since rte_mbuf_core.h requires generic/rte_atomic.h, shouldn't 
rte_mbuf_core.h include generic/rte_atomic.h instead of forcing users of 
rte_mbuf_core.h be aware of this dependency?


[ ]'s
Michel Machado

On 8/16/19 8:53 AM, Konstantin Ananyev wrote:

Get rid of duplicate definitions for rte_mbuf and related macros.
Include rte_mbuf_core.h instead.

Signed-off-by: Konstantin Ananyev 
---
  examples/bpf/mbuf.h | 605 
  examples/bpf/t2.c   |   5 +-
  examples/bpf/t3.c   |   3 +-
  3 files changed, 5 insertions(+), 608 deletions(-)
  delete mode 100644 examples/bpf/mbuf.h

diff --git a/examples/bpf/mbuf.h b/examples/bpf/mbuf.h
deleted file mode 100644
index e41c3d26e..0
--- a/examples/bpf/mbuf.h
+++ /dev/null
@@ -1,605 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation.
- * Copyright 2014 6WIND S.A.
- */
-
-/*
- * Snipper from dpdk.org rte_mbuf.h.
- * used to provide BPF programs information about rte_mbuf layout.
- */
-
-#ifndef _MBUF_H_
-#define _MBUF_H_
-
-#include 
-#include 
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-/*
- * Packet Offload Features Flags. It also carry packet type information.
- * Critical resources. Both rx/tx shared these bits. Be cautious on any change
- *
- * - RX flags start at bit position zero, and get added to the left of previous
- *   flags.
- * - The most-significant 3 bits are reserved for generic mbuf flags
- * - TX flags therefore start at bit position 60 (i.e. 63-3), and new flags get
- *   added to the right of the previously defined flags i.e. they should count
- *   downwards, not upwards.
- *
- * Keep these flags synchronized with rte_get_rx_ol_flag_name() and
- * rte_get_tx_ol_flag_name().
- */
-
-/**
- * RX packet is a 802.1q VLAN packet. This flag was set by PMDs when
- * the packet is recognized as a VLAN, but the behavior between PMDs
- * was not the same. This flag is kept for some time to avoid breaking
- * applications and should be replaced by PKT_RX_VLAN_STRIPPED.
- */
-#define PKT_RX_VLAN_PKT  (1ULL << 0)
-
-#define PKT_RX_RSS_HASH  (1ULL << 1)
-/**< RX packet with RSS hash result. */
-#define PKT_RX_FDIR  (1ULL << 2)
-/**< RX packet with FDIR match indicate. */
-
-/**
- * Deprecated.
- * Checking this flag alone is deprecated: check the 2 bits of
- * PKT_RX_L4_CKSUM_MASK.
- * This flag was set when the L4 checksum of a packet was detected as
- * wrong by the hardware.
- */
-#define PKT_RX_L4_CKSUM_BAD  (1ULL << 3)
-
-/**
- * Deprecated.
- * Checking this flag alone is deprecated: check the 2 bits of
- * PKT_RX_IP_CKSUM_MASK.
- * This flag was set when the IP checksum of a packet was detected as
- * wrong by the hardware.
- */
-#define PKT_RX_IP_CKSUM_BAD  (1ULL << 4)
-
-#define PKT_RX_EIP_CKSUM_BAD (1ULL << 5)
-/**< External IP header checksum error. */
-
-/**
- * A vlan has been stripped by the hardware and its tci is saved in
- * mbuf->vlan_tci. This can only happen if vlan stripping is enabled
- * in the RX configuration of the PMD.
- */
-#define PKT_RX_VLAN_STRIPPED (1ULL << 6)
-
-/**
- * Mask of bits used to determine the status of RX IP checksum.
- * - PKT_RX_IP_CKSUM_UNKNOWN: no information about the RX IP checksum
- * - PKT_RX_IP_CKSUM_BAD: the IP checksum in the packet is wrong
- * - PKT_RX_IP_CKSUM_GOOD: the IP checksum in the packet is valid
- * - PKT_RX_IP_CKSUM_NONE: the IP checksum is not correct in the packet
- *   data, but the integrity of the IP header is verified.
- */
-#define PKT_RX_IP_CKSUM_MASK ((1ULL << 4) | (1ULL << 7))
-
-#define PKT_RX_IP_CKSUM_UNKNOWN 0
-#define PKT_RX_IP_CKSUM_BAD (1ULL << 4)
-#define PKT_RX_IP_CKSUM_GOOD(1ULL << 7)
-#define PKT_RX_IP_CKSUM_NONE((1ULL << 4) | (1ULL << 7))
-
-/**
- * Mask of bits used to determine the status of RX L4 checksum.
- * - PKT_RX_L4_CKSUM_UNKNOWN: no information about the RX L4 checksum
- * - PKT_RX_L4_CKSUM_BAD: the L4 checksum in the packet is wrong
- * - PKT_RX_L4_CKSUM_GOOD: the L4 checksum in the packet is valid
- * - PKT_RX_L4_CKSUM_NONE: the L4 checksum is not correct in the packet
- *   data, but the integrity of the L4 data is verified.
- */
-#define PKT_RX_L4_CKSUM_MASK ((1ULL << 3) | (1ULL << 8))
-
-#define PKT_RX_L4_CKSUM_UNKNOWN 0
-#define PKT_RX_L4_CKSUM_BAD (1ULL << 3)
-#define PKT_RX_L4_CKSUM_GOOD(1ULL << 8)
-#define PKT_RX_L4_CKSUM_NONE((1ULL << 3) | (1ULL << 8))
-
-#define PKT_RX_IEEE1588_PTP  (1ULL << 9)
-/**< RX IEEE1588 L2 Ethernet PT Packet. */
-#define PKT_RX_IEEE1588_TMST (1ULL << 10)
-/**< RX IEEE1588 L2/L4 timestamped packet.*/
-#define PKT_RX_FDIR_ID   (1ULL << 13)
-/**< FD id reported if FDIR match. */
-#define PKT_RX_FDIR_FLX  (1ULL << 14)
-/**< Flexible bytes reported if FDIR match. */
-
-/**
- * The 2 vlans have been stripped by 

Re: [dpdk-dev] [PATCH 1/3] eal: move CACHE and IOVA related definitions

2019-08-20 Thread Michel Machado

Acked-by: Michel Machado 

[ ]'s
Michel Machado

On 8/16/19 8:53 AM, Konstantin Ananyev wrote:

Right now RTE_CACHE_ and IOVA definitions are located inside rte_memory.h
That might cause an unwanted inclusions of arch/os specific header files.
See [1] for particular problem example.
Probably the simplest way to deal with such problems -
move these definitions into rte_commmon.h

Note that this move doesn't introduce any change in functionality.

[1] https://bugs.dpdk.org/show_bug.cgi?id=321

Suggested-by: Vipin Varghese 
Signed-off-by: Konstantin Ananyev 
---
  lib/librte_eal/common/include/rte_common.h | 44 ++
  lib/librte_eal/common/include/rte_memory.h | 38 ---
  2 files changed, 44 insertions(+), 38 deletions(-)

diff --git a/lib/librte_eal/common/include/rte_common.h 
b/lib/librte_eal/common/include/rte_common.h
index 05a3a6401..c275093d7 100644
--- a/lib/librte_eal/common/include/rte_common.h
+++ b/lib/librte_eal/common/include/rte_common.h
@@ -291,6 +291,50 @@ rte_is_aligned(void *ptr, unsigned align)
   */
  #define RTE_BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
  
+/*** RTE_CACHE related macros /

+
+#define RTE_CACHE_LINE_MASK (RTE_CACHE_LINE_SIZE-1) /**< Cache line mask. */
+
+#define RTE_CACHE_LINE_ROUNDUP(size) \
+   (RTE_CACHE_LINE_SIZE * ((size + RTE_CACHE_LINE_SIZE - 1) / \
+   RTE_CACHE_LINE_SIZE))
+/**< Return the first cache-aligned value greater or equal to size. */
+
+/**< Cache line size in terms of log2 */
+#if RTE_CACHE_LINE_SIZE == 64
+#define RTE_CACHE_LINE_SIZE_LOG2 6
+#elif RTE_CACHE_LINE_SIZE == 128
+#define RTE_CACHE_LINE_SIZE_LOG2 7
+#else
+#error "Unsupported cache line size"
+#endif
+
+#define RTE_CACHE_LINE_MIN_SIZE 64 /**< Minimum Cache line size. */
+
+/**
+ * Force alignment to cache line.
+ */
+#define __rte_cache_aligned __rte_aligned(RTE_CACHE_LINE_SIZE)
+
+/**
+ * Force minimum cache line alignment.
+ */
+#define __rte_cache_min_aligned __rte_aligned(RTE_CACHE_LINE_MIN_SIZE)
+
+/*** PA/IOVA type definitions /
+
+typedef uint64_t phys_addr_t; /**< Physical address. */
+#define RTE_BAD_PHYS_ADDR ((phys_addr_t)-1)
+/**
+ * IO virtual address type.
+ * When the physical addressing mode (IOVA as PA) is in use,
+ * the translation from an IO virtual address (IOVA) to a physical address
+ * is a direct mapping, i.e. the same value.
+ * Otherwise, in virtual mode (IOVA as VA), an IOMMU may do the translation.
+ */
+typedef uint64_t rte_iova_t;
+#define RTE_BAD_IOVA ((rte_iova_t)-1)
+
  /**
   * Combines 32b inputs most significant set bits into the least
   * significant bits to construct a value with the same MSBs as x
diff --git a/lib/librte_eal/common/include/rte_memory.h 
b/lib/librte_eal/common/include/rte_memory.h
index 4717dcb43..38e00e382 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -39,44 +39,6 @@ enum rte_page_sizes {
  };
  
  #define SOCKET_ID_ANY -1/**< Any NUMA socket. */

-#define RTE_CACHE_LINE_MASK (RTE_CACHE_LINE_SIZE-1) /**< Cache line mask. */
-
-#define RTE_CACHE_LINE_ROUNDUP(size) \
-   (RTE_CACHE_LINE_SIZE * ((size + RTE_CACHE_LINE_SIZE - 1) / 
RTE_CACHE_LINE_SIZE))
-/**< Return the first cache-aligned value greater or equal to size. */
-
-/**< Cache line size in terms of log2 */
-#if RTE_CACHE_LINE_SIZE == 64
-#define RTE_CACHE_LINE_SIZE_LOG2 6
-#elif RTE_CACHE_LINE_SIZE == 128
-#define RTE_CACHE_LINE_SIZE_LOG2 7
-#else
-#error "Unsupported cache line size"
-#endif
-
-#define RTE_CACHE_LINE_MIN_SIZE 64 /**< Minimum Cache line size. */
-
-/**
- * Force alignment to cache line.
- */
-#define __rte_cache_aligned __rte_aligned(RTE_CACHE_LINE_SIZE)
-
-/**
- * Force minimum cache line alignment.
- */
-#define __rte_cache_min_aligned __rte_aligned(RTE_CACHE_LINE_MIN_SIZE)
-
-typedef uint64_t phys_addr_t; /**< Physical address. */
-#define RTE_BAD_PHYS_ADDR ((phys_addr_t)-1)
-/**
- * IO virtual address type.
- * When the physical addressing mode (IOVA as PA) is in use,
- * the translation from an IO virtual address (IOVA) to a physical address
- * is a direct mapping, i.e. the same value.
- * Otherwise, in virtual mode (IOVA as VA), an IOMMU may do the translation.
- */
-typedef uint64_t rte_iova_t;
-#define RTE_BAD_IOVA ((rte_iova_t)-1)
  
  /**

   * Physical memory segment descriptor.



Re: [dpdk-dev] [PATCH 2/3] mbuf: move mbuf definition into a separate file

2019-08-20 Thread Michel Machado

Acked-by: Michel Machado 

[ ]'s
Michel Machado

On 8/16/19 8:53 AM, Konstantin Ananyev wrote:

Right now inclusion of rte_mbuf.h header can cause inclusion of
some arch/os specific headers.
That prevents it to be included directly by some
non-DPDK (but related) entities: KNI, BPF programs, etc.
To overcome that problem usually a separate definitions of rte_mbuf
structure is created within these entities.
That aproach has a lot of drawbacks: code duplication, error prone, etc.
This patch moves rte_mbuf structure definition (and some related macros)
into a separate file that can be included by both rte_mbuf.h and
other non-DPDK entities.

Note that it doesn't introduce any change for current DPDK code.

Signed-off-by: Konstantin Ananyev 
---
  lib/librte_mbuf/Makefile|   5 +-
  lib/librte_mbuf/meson.build |   3 +-
  lib/librte_mbuf/rte_mbuf.h  | 738 +
  lib/librte_mbuf/rte_mbuf_core.h | 793 
  4 files changed, 800 insertions(+), 739 deletions(-)
  create mode 100644 lib/librte_mbuf/rte_mbuf_core.h

diff --git a/lib/librte_mbuf/Makefile b/lib/librte_mbuf/Makefile
index c8f6d2689..f3b76ad23 100644
--- a/lib/librte_mbuf/Makefile
+++ b/lib/librte_mbuf/Makefile
@@ -19,6 +19,9 @@ LIBABIVER := 5
  SRCS-$(CONFIG_RTE_LIBRTE_MBUF) := rte_mbuf.c rte_mbuf_ptype.c 
rte_mbuf_pool_ops.c
  
  # install includes

-SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include := rte_mbuf.h rte_mbuf_ptype.h 
rte_mbuf_pool_ops.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include := rte_mbuf.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include += rte_mbuf_core.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include += rte_mbuf_ptype.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include += rte_mbuf_pool_ops.h
  
  include $(RTE_SDK)/mk/rte.lib.mk

diff --git a/lib/librte_mbuf/meson.build b/lib/librte_mbuf/meson.build
index 6cc11ebb4..36bb6eb9d 100644
--- a/lib/librte_mbuf/meson.build
+++ b/lib/librte_mbuf/meson.build
@@ -3,7 +3,8 @@
  
  version = 5

  sources = files('rte_mbuf.c', 'rte_mbuf_ptype.c', 'rte_mbuf_pool_ops.c')
-headers = files('rte_mbuf.h', 'rte_mbuf_ptype.h', 'rte_mbuf_pool_ops.h')
+headers = files('rte_mbuf.h', 'rte_mbuf_core.h',
+   'rte_mbuf_ptype.h', 'rte_mbuf_pool_ops.h')
  deps += ['mempool']
  
  allow_experimental_apis = true

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 98225ec80..528b2f121 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -42,367 +42,12 @@
  #include 
  #include 
  #include 
+#include 
  
  #ifdef __cplusplus

  extern "C" {
  #endif
  
-/*

- * Packet Offload Features Flags. It also carry packet type information.
- * Critical resources. Both rx/tx shared these bits. Be cautious on any change
- *
- * - RX flags start at bit position zero, and get added to the left of previous
- *   flags.
- * - The most-significant 3 bits are reserved for generic mbuf flags
- * - TX flags therefore start at bit position 60 (i.e. 63-3), and new flags get
- *   added to the right of the previously defined flags i.e. they should count
- *   downwards, not upwards.
- *
- * Keep these flags synchronized with rte_get_rx_ol_flag_name() and
- * rte_get_tx_ol_flag_name().
- */
-
-/**
- * The RX packet is a 802.1q VLAN packet, and the tci has been
- * saved in in mbuf->vlan_tci.
- * If the flag PKT_RX_VLAN_STRIPPED is also present, the VLAN
- * header has been stripped from mbuf data, else it is still
- * present.
- */
-#define PKT_RX_VLAN  (1ULL << 0)
-
-#define PKT_RX_RSS_HASH  (1ULL << 1)  /**< RX packet with RSS hash result. 
*/
-#define PKT_RX_FDIR  (1ULL << 2)  /**< RX packet with FDIR match 
indicate. */
-
-/**
- * Deprecated.
- * Checking this flag alone is deprecated: check the 2 bits of
- * PKT_RX_L4_CKSUM_MASK.
- * This flag was set when the L4 checksum of a packet was detected as
- * wrong by the hardware.
- */
-#define PKT_RX_L4_CKSUM_BAD  (1ULL << 3)
-
-/**
- * Deprecated.
- * Checking this flag alone is deprecated: check the 2 bits of
- * PKT_RX_IP_CKSUM_MASK.
- * This flag was set when the IP checksum of a packet was detected as
- * wrong by the hardware.
- */
-#define PKT_RX_IP_CKSUM_BAD  (1ULL << 4)
-
-#define PKT_RX_EIP_CKSUM_BAD (1ULL << 5)  /**< External IP header checksum 
error. */
-
-/**
- * A vlan has been stripped by the hardware and its tci is saved in
- * mbuf->vlan_tci. This can only happen if vlan stripping is enabled
- * in the RX configuration of the PMD.
- * When PKT_RX_VLAN_STRIPPED is set, PKT_RX_VLAN must also be set.
- */
-#define PKT_RX_VLAN_STRIPPED (1ULL << 6)
-
-/**
- * Mask of bits used to determine the status of RX IP checksum.
- * - PKT_RX_IP_CKSUM_UNKNOWN: no information about the RX IP checksum
- * - PKT_RX_IP_CKSUM_BAD: the IP checksum in the packet is wrong
- * - PKT_R

Re: [dpdk-dev] [PATCH v2 1/3] eal: move CACHE and IOVA related definitions

2019-10-02 Thread Michel Machado

Acked-by: Michel Machado 

[ ]'s
Michel Machado

On 9/27/19 9:50 AM, Konstantin Ananyev wrote:

Right now RTE_CACHE_ and IOVA definitions are located inside rte_memory.h
That might cause an unwanted inclusions of arch/os specific header files.
See [1] for particular problem example.
Probably the simplest way to deal with such problems -
move these definitions into rte_commmon.h

Note that this move doesn't introduce any change in functionality.

[1] https://bugs.dpdk.org/show_bug.cgi?id=321

Suggested-by: Vipin Varghese 
Signed-off-by: Konstantin Ananyev 
---
  lib/librte_eal/common/include/rte_common.h | 44 ++
  lib/librte_eal/common/include/rte_memory.h | 38 ---
  2 files changed, 44 insertions(+), 38 deletions(-)

diff --git a/lib/librte_eal/common/include/rte_common.h 
b/lib/librte_eal/common/include/rte_common.h
index 05a3a6401..c275093d7 100644
--- a/lib/librte_eal/common/include/rte_common.h
+++ b/lib/librte_eal/common/include/rte_common.h
@@ -291,6 +291,50 @@ rte_is_aligned(void *ptr, unsigned align)
   */
  #define RTE_BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
  
+/*** RTE_CACHE related macros /

+
+#define RTE_CACHE_LINE_MASK (RTE_CACHE_LINE_SIZE-1) /**< Cache line mask. */
+
+#define RTE_CACHE_LINE_ROUNDUP(size) \
+   (RTE_CACHE_LINE_SIZE * ((size + RTE_CACHE_LINE_SIZE - 1) / \
+   RTE_CACHE_LINE_SIZE))
+/**< Return the first cache-aligned value greater or equal to size. */
+
+/**< Cache line size in terms of log2 */
+#if RTE_CACHE_LINE_SIZE == 64
+#define RTE_CACHE_LINE_SIZE_LOG2 6
+#elif RTE_CACHE_LINE_SIZE == 128
+#define RTE_CACHE_LINE_SIZE_LOG2 7
+#else
+#error "Unsupported cache line size"
+#endif
+
+#define RTE_CACHE_LINE_MIN_SIZE 64 /**< Minimum Cache line size. */
+
+/**
+ * Force alignment to cache line.
+ */
+#define __rte_cache_aligned __rte_aligned(RTE_CACHE_LINE_SIZE)
+
+/**
+ * Force minimum cache line alignment.
+ */
+#define __rte_cache_min_aligned __rte_aligned(RTE_CACHE_LINE_MIN_SIZE)
+
+/*** PA/IOVA type definitions /
+
+typedef uint64_t phys_addr_t; /**< Physical address. */
+#define RTE_BAD_PHYS_ADDR ((phys_addr_t)-1)
+/**
+ * IO virtual address type.
+ * When the physical addressing mode (IOVA as PA) is in use,
+ * the translation from an IO virtual address (IOVA) to a physical address
+ * is a direct mapping, i.e. the same value.
+ * Otherwise, in virtual mode (IOVA as VA), an IOMMU may do the translation.
+ */
+typedef uint64_t rte_iova_t;
+#define RTE_BAD_IOVA ((rte_iova_t)-1)
+
  /**
   * Combines 32b inputs most significant set bits into the least
   * significant bits to construct a value with the same MSBs as x
diff --git a/lib/librte_eal/common/include/rte_memory.h 
b/lib/librte_eal/common/include/rte_memory.h
index 4717dcb43..38e00e382 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -39,44 +39,6 @@ enum rte_page_sizes {
  };
  
  #define SOCKET_ID_ANY -1/**< Any NUMA socket. */

-#define RTE_CACHE_LINE_MASK (RTE_CACHE_LINE_SIZE-1) /**< Cache line mask. */
-
-#define RTE_CACHE_LINE_ROUNDUP(size) \
-   (RTE_CACHE_LINE_SIZE * ((size + RTE_CACHE_LINE_SIZE - 1) / 
RTE_CACHE_LINE_SIZE))
-/**< Return the first cache-aligned value greater or equal to size. */
-
-/**< Cache line size in terms of log2 */
-#if RTE_CACHE_LINE_SIZE == 64
-#define RTE_CACHE_LINE_SIZE_LOG2 6
-#elif RTE_CACHE_LINE_SIZE == 128
-#define RTE_CACHE_LINE_SIZE_LOG2 7
-#else
-#error "Unsupported cache line size"
-#endif
-
-#define RTE_CACHE_LINE_MIN_SIZE 64 /**< Minimum Cache line size. */
-
-/**
- * Force alignment to cache line.
- */
-#define __rte_cache_aligned __rte_aligned(RTE_CACHE_LINE_SIZE)
-
-/**
- * Force minimum cache line alignment.
- */
-#define __rte_cache_min_aligned __rte_aligned(RTE_CACHE_LINE_MIN_SIZE)
-
-typedef uint64_t phys_addr_t; /**< Physical address. */
-#define RTE_BAD_PHYS_ADDR ((phys_addr_t)-1)
-/**
- * IO virtual address type.
- * When the physical addressing mode (IOVA as PA) is in use,
- * the translation from an IO virtual address (IOVA) to a physical address
- * is a direct mapping, i.e. the same value.
- * Otherwise, in virtual mode (IOVA as VA), an IOMMU may do the translation.
- */
-typedef uint64_t rte_iova_t;
-#define RTE_BAD_IOVA ((rte_iova_t)-1)
  
  /**

   * Physical memory segment descriptor.



Re: [dpdk-dev] [PATCH v2 2/3] mbuf: move mbuf definition into a separate file

2019-10-02 Thread Michel Machado

Acked-by: Michel Machado 

[ ]'s
Michel Machado

On 9/27/19 9:50 AM, Konstantin Ananyev wrote:

Right now inclusion of rte_mbuf.h header can cause inclusion of
some arch/os specific headers.
That prevents it to be included directly by some
non-DPDK (but related) entities: KNI, BPF programs, etc.
To overcome that problem usually a separate definitions of rte_mbuf
structure is created within these entities.
That aproach has a lot of drawbacks: code duplication, error prone, etc.
This patch moves rte_mbuf structure definition (and some related macros)
into a separate file that can be included by both rte_mbuf.h and
other non-DPDK entities.

Note that it doesn't introduce any change for current DPDK code.

Signed-off-by: Konstantin Ananyev 
---
  lib/librte_mbuf/Makefile|   5 +-
  lib/librte_mbuf/meson.build |   3 +-
  lib/librte_mbuf/rte_mbuf.h  | 738 +
  lib/librte_mbuf/rte_mbuf_core.h | 793 
  4 files changed, 800 insertions(+), 739 deletions(-)
  create mode 100644 lib/librte_mbuf/rte_mbuf_core.h

diff --git a/lib/librte_mbuf/Makefile b/lib/librte_mbuf/Makefile
index c8f6d2689..f3b76ad23 100644
--- a/lib/librte_mbuf/Makefile
+++ b/lib/librte_mbuf/Makefile
@@ -19,6 +19,9 @@ LIBABIVER := 5
  SRCS-$(CONFIG_RTE_LIBRTE_MBUF) := rte_mbuf.c rte_mbuf_ptype.c 
rte_mbuf_pool_ops.c
  
  # install includes

-SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include := rte_mbuf.h rte_mbuf_ptype.h 
rte_mbuf_pool_ops.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include := rte_mbuf.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include += rte_mbuf_core.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include += rte_mbuf_ptype.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_MBUF)-include += rte_mbuf_pool_ops.h
  
  include $(RTE_SDK)/mk/rte.lib.mk

diff --git a/lib/librte_mbuf/meson.build b/lib/librte_mbuf/meson.build
index 6cc11ebb4..36bb6eb9d 100644
--- a/lib/librte_mbuf/meson.build
+++ b/lib/librte_mbuf/meson.build
@@ -3,7 +3,8 @@
  
  version = 5

  sources = files('rte_mbuf.c', 'rte_mbuf_ptype.c', 'rte_mbuf_pool_ops.c')
-headers = files('rte_mbuf.h', 'rte_mbuf_ptype.h', 'rte_mbuf_pool_ops.h')
+headers = files('rte_mbuf.h', 'rte_mbuf_core.h',
+   'rte_mbuf_ptype.h', 'rte_mbuf_pool_ops.h')
  deps += ['mempool']
  
  allow_experimental_apis = true

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 98225ec80..528b2f121 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -42,367 +42,12 @@
  #include 
  #include 
  #include 
+#include 
  
  #ifdef __cplusplus

  extern "C" {
  #endif
  
-/*

- * Packet Offload Features Flags. It also carry packet type information.
- * Critical resources. Both rx/tx shared these bits. Be cautious on any change
- *
- * - RX flags start at bit position zero, and get added to the left of previous
- *   flags.
- * - The most-significant 3 bits are reserved for generic mbuf flags
- * - TX flags therefore start at bit position 60 (i.e. 63-3), and new flags get
- *   added to the right of the previously defined flags i.e. they should count
- *   downwards, not upwards.
- *
- * Keep these flags synchronized with rte_get_rx_ol_flag_name() and
- * rte_get_tx_ol_flag_name().
- */
-
-/**
- * The RX packet is a 802.1q VLAN packet, and the tci has been
- * saved in in mbuf->vlan_tci.
- * If the flag PKT_RX_VLAN_STRIPPED is also present, the VLAN
- * header has been stripped from mbuf data, else it is still
- * present.
- */
-#define PKT_RX_VLAN  (1ULL << 0)
-
-#define PKT_RX_RSS_HASH  (1ULL << 1)  /**< RX packet with RSS hash result. 
*/
-#define PKT_RX_FDIR  (1ULL << 2)  /**< RX packet with FDIR match 
indicate. */
-
-/**
- * Deprecated.
- * Checking this flag alone is deprecated: check the 2 bits of
- * PKT_RX_L4_CKSUM_MASK.
- * This flag was set when the L4 checksum of a packet was detected as
- * wrong by the hardware.
- */
-#define PKT_RX_L4_CKSUM_BAD  (1ULL << 3)
-
-/**
- * Deprecated.
- * Checking this flag alone is deprecated: check the 2 bits of
- * PKT_RX_IP_CKSUM_MASK.
- * This flag was set when the IP checksum of a packet was detected as
- * wrong by the hardware.
- */
-#define PKT_RX_IP_CKSUM_BAD  (1ULL << 4)
-
-#define PKT_RX_EIP_CKSUM_BAD (1ULL << 5)  /**< External IP header checksum 
error. */
-
-/**
- * A vlan has been stripped by the hardware and its tci is saved in
- * mbuf->vlan_tci. This can only happen if vlan stripping is enabled
- * in the RX configuration of the PMD.
- * When PKT_RX_VLAN_STRIPPED is set, PKT_RX_VLAN must also be set.
- */
-#define PKT_RX_VLAN_STRIPPED (1ULL << 6)
-
-/**
- * Mask of bits used to determine the status of RX IP checksum.
- * - PKT_RX_IP_CKSUM_UNKNOWN: no information about the RX IP checksum
- * - PKT_RX_IP_CKSUM_BAD: the IP checksum in the packet is wrong
- * - PKT_R

Re: [dpdk-dev] [PATCH v2 3/3] examples/bpf: remove duplicate mbuf definition

2019-10-02 Thread Michel Machado

Acked-by: Michel Machado 

[ ]'s
Michel Machado

On 9/27/19 9:50 AM, Konstantin Ananyev wrote:

Get rid of duplicate definitions for rte_mbuf and related macros.
Include rte_mbuf_core.h instead.

Signed-off-by: Konstantin Ananyev 
---
  examples/bpf/mbuf.h | 605 
  examples/bpf/t2.c   |   4 +-
  examples/bpf/t3.c   |   2 +-
  3 files changed, 3 insertions(+), 608 deletions(-)
  delete mode 100644 examples/bpf/mbuf.h

diff --git a/examples/bpf/mbuf.h b/examples/bpf/mbuf.h
deleted file mode 100644
index e41c3d26e..0
--- a/examples/bpf/mbuf.h
+++ /dev/null
@@ -1,605 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation.
- * Copyright 2014 6WIND S.A.
- */
-
-/*
- * Snipper from dpdk.org rte_mbuf.h.
- * used to provide BPF programs information about rte_mbuf layout.
- */
-
-#ifndef _MBUF_H_
-#define _MBUF_H_
-
-#include 
-#include 
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-/*
- * Packet Offload Features Flags. It also carry packet type information.
- * Critical resources. Both rx/tx shared these bits. Be cautious on any change
- *
- * - RX flags start at bit position zero, and get added to the left of previous
- *   flags.
- * - The most-significant 3 bits are reserved for generic mbuf flags
- * - TX flags therefore start at bit position 60 (i.e. 63-3), and new flags get
- *   added to the right of the previously defined flags i.e. they should count
- *   downwards, not upwards.
- *
- * Keep these flags synchronized with rte_get_rx_ol_flag_name() and
- * rte_get_tx_ol_flag_name().
- */
-
-/**
- * RX packet is a 802.1q VLAN packet. This flag was set by PMDs when
- * the packet is recognized as a VLAN, but the behavior between PMDs
- * was not the same. This flag is kept for some time to avoid breaking
- * applications and should be replaced by PKT_RX_VLAN_STRIPPED.
- */
-#define PKT_RX_VLAN_PKT  (1ULL << 0)
-
-#define PKT_RX_RSS_HASH  (1ULL << 1)
-/**< RX packet with RSS hash result. */
-#define PKT_RX_FDIR  (1ULL << 2)
-/**< RX packet with FDIR match indicate. */
-
-/**
- * Deprecated.
- * Checking this flag alone is deprecated: check the 2 bits of
- * PKT_RX_L4_CKSUM_MASK.
- * This flag was set when the L4 checksum of a packet was detected as
- * wrong by the hardware.
- */
-#define PKT_RX_L4_CKSUM_BAD  (1ULL << 3)
-
-/**
- * Deprecated.
- * Checking this flag alone is deprecated: check the 2 bits of
- * PKT_RX_IP_CKSUM_MASK.
- * This flag was set when the IP checksum of a packet was detected as
- * wrong by the hardware.
- */
-#define PKT_RX_IP_CKSUM_BAD  (1ULL << 4)
-
-#define PKT_RX_EIP_CKSUM_BAD (1ULL << 5)
-/**< External IP header checksum error. */
-
-/**
- * A vlan has been stripped by the hardware and its tci is saved in
- * mbuf->vlan_tci. This can only happen if vlan stripping is enabled
- * in the RX configuration of the PMD.
- */
-#define PKT_RX_VLAN_STRIPPED (1ULL << 6)
-
-/**
- * Mask of bits used to determine the status of RX IP checksum.
- * - PKT_RX_IP_CKSUM_UNKNOWN: no information about the RX IP checksum
- * - PKT_RX_IP_CKSUM_BAD: the IP checksum in the packet is wrong
- * - PKT_RX_IP_CKSUM_GOOD: the IP checksum in the packet is valid
- * - PKT_RX_IP_CKSUM_NONE: the IP checksum is not correct in the packet
- *   data, but the integrity of the IP header is verified.
- */
-#define PKT_RX_IP_CKSUM_MASK ((1ULL << 4) | (1ULL << 7))
-
-#define PKT_RX_IP_CKSUM_UNKNOWN 0
-#define PKT_RX_IP_CKSUM_BAD (1ULL << 4)
-#define PKT_RX_IP_CKSUM_GOOD(1ULL << 7)
-#define PKT_RX_IP_CKSUM_NONE((1ULL << 4) | (1ULL << 7))
-
-/**
- * Mask of bits used to determine the status of RX L4 checksum.
- * - PKT_RX_L4_CKSUM_UNKNOWN: no information about the RX L4 checksum
- * - PKT_RX_L4_CKSUM_BAD: the L4 checksum in the packet is wrong
- * - PKT_RX_L4_CKSUM_GOOD: the L4 checksum in the packet is valid
- * - PKT_RX_L4_CKSUM_NONE: the L4 checksum is not correct in the packet
- *   data, but the integrity of the L4 data is verified.
- */
-#define PKT_RX_L4_CKSUM_MASK ((1ULL << 3) | (1ULL << 8))
-
-#define PKT_RX_L4_CKSUM_UNKNOWN 0
-#define PKT_RX_L4_CKSUM_BAD (1ULL << 3)
-#define PKT_RX_L4_CKSUM_GOOD(1ULL << 8)
-#define PKT_RX_L4_CKSUM_NONE((1ULL << 3) | (1ULL << 8))
-
-#define PKT_RX_IEEE1588_PTP  (1ULL << 9)
-/**< RX IEEE1588 L2 Ethernet PT Packet. */
-#define PKT_RX_IEEE1588_TMST (1ULL << 10)
-/**< RX IEEE1588 L2/L4 timestamped packet.*/
-#define PKT_RX_FDIR_ID   (1ULL << 13)
-/**< FD id reported if FDIR match. */
-#define PKT_RX_FDIR_FLX  (1ULL << 14)
-/**< Flexible bytes reported if FDIR match. */
-
-/**
- * The 2 vlans have been stripped by the hardware and their tci are
- * saved in mbuf->vlan_tci (inner) and mbuf->vlan_tci_outer (outer).
- * This can only happen if vlan stripping is enable

Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data

2020-10-15 Thread Michel Machado

On 10/14/20 7:57 PM, Honnappa Nagarahalli wrote:

On 13/10/2020 18:46, Michel Machado wrote:

On 10/13/20 11:41 AM, Medvedkin, Vladimir wrote:

Hi Michel,

Could you please describe a condition when LPM gets inconsistent?
As I can see if there is no free tbl8 it will return -ENOSPC.


     Consider this simple example, we need to add the following two
prefixes with different next hops: 10.99.0.0/16, 18.99.99.128/25. If
the LPM table is out of tbl8s, the second prefix is not added and
Gatekeeper will make decisions in violation of the policy. The data
structure of the LPM table is consistent, but its content
inconsistent with the policy.

max_rules and number_tbl8s in 'struct rte_lpm' contain the config information. 
These 2 fields do not change based on the routes added and do not indicate the 
amount of space left. So, you cannot use this information to decide if there is 
enough space to add more routes.


   We are aware that those fields hold the config information not a 
status of the LPM table.


   Before updating a LPM table that holds network prefixes derived from 
threat intelligence, we compute the minimum values for max_rules and 
number_tbl8s. Here is an example of how we do it: 
https://github.com/AltraMayor/gatekeeper/blob/95d1d6e8201861a0d0c698bfd06ad606674f1e07/lua/examples/policy.lua#L135-L166


   Once these minimum values are available, we get the parameters of 
the LPM table to be updated and check if we can update it, or have to 
recreate it.



Aha, thanks. So do I understand correctly that you need to add a set
of routes atomically (either the entire set is installed or nothing)?


     Yes.


If so, then I would suggest having 2 lpm and switching them
atomically after a successful addition. As for now, even if you have
enough tbl8's, routes are installed non atomically, i.e. there will
be a time gap between adding two routes, so in this time interval the
table will be inconsistent with the policy.
Also, if new lpm algorithms are added to the DPDK, they won't have
such a thing as tbl8.


     Our code already deals with synchronization.

If the application code already deals with synchronization, is it possible to 
revert back (i.e. delete the routes that got added so far) when the addition of 
the route-set fails?


   The way the code is structured, this would require a significant 
rewrite because the code assumes that it will succeed since the capacity 
of the LPM tables was already checked.



On 13/10/2020 15:58, Michel Machado wrote:

Hi Kevin,

     We do need fields max_rules and number_tbl8s of struct
rte_lpm, so the removal would force us to have another patch to
our local copy of DPDK. We'd rather avoid this new local patch
because we wish to eventually be in sync with the stock DPDK.

     Those fields are needed in Gatekeeper because we found a
condition in an ongoing deployment in which the entries of some
LPM tables may suddenly change a lot to reflect policy changes. To
avoid getting into a state in which the LPM table is inconsistent
because it cannot fit all the new entries, we compute the needed
parameters to support the new entries, and compare with the
current parameters. If the current table doesn't fit everything,
we have to replace it with a new LPM table.

     If there were a way to obtain the struct rte_lpm_config of a
given LPM table, it would cleanly address our need. We have the
same need in IPv6 and have a local patch to work around it (see


https://github.com/cjdoucette/dpdk/commit/3eaf124a781349b8ec8cd880db
26a78115cb8c8f).

I do not see why such an API is not possible, we could add one API that returns 
max_rules and number_tbl8s (essentially, the config that was passed to 
rte_lpm_create API).
But, is there a possibility to store that info in the application as that data 
was passed to rte_lpm from the application?


   A suggestion for what this API could look like:

void rte_lpm_get_config(const struct rte_lpm *lpm, struct rte_lpm_config 
*config);
void rte_lpm6_get_config(const struct rte_lpm6 *lpm, struct 
rte_lpm6_config *config);


   If the final choice is for not supporting a way to retrieve the 
config information on the API, we'll look for a place to keep a copy of 
the parameters in our code.


Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data

2020-10-16 Thread Michel Machado

  If the final choice is for not supporting a way to retrieve the
config information on the API, we'll look for a place to keep a copy
of the parameters in our code.

IMO, this is not a performance critical path and it is not a difficult solution 
to

store these values in the application. My suggestion is to skip adding the API
and store the values in the application.

Vladimir, what's your opinion?


Agree. Global vars or part of a global configuration could be used here.

Thank you. I think we are fine to go ahead with merging this patch.



Michel, I know it's a bit of churn so not zero effort, but is that
solution workable for you? The change in DPDK would not be backported,
so gatekeeper code would only need an update on updating to use DPDK 20.11.


   It's workable.

   Thank you for organizing this discussion, Kevin.


P.S. As you are using DPDK 19.08, we should talk about DPDK LTS but
let's leave that for another thread :-)


   Sure.


Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data

2020-10-13 Thread Michel Machado

On 10/13/20 11:41 AM, Medvedkin, Vladimir wrote:

Hi Michel,

Could you please describe a condition when LPM gets inconsistent? As I 
can see if there is no free tbl8 it will return -ENOSPC.


   Consider this simple example, we need to add the following two 
prefixes with different next hops: 10.99.0.0/16, 18.99.99.128/25. If the 
LPM table is out of tbl8s, the second prefix is not added and Gatekeeper 
will make decisions in violation of the policy. The data structure of 
the LPM table is consistent, but its content inconsistent with the policy.


   We minimize the need of replacing a LPM table by allocating LPM 
tables with the double of what we need (see example here 
https://github.com/AltraMayor/gatekeeper/blob/95d1d6e8201861a0d0c698bfd06ad606674f1e07/lua/examples/policy.lua#L172-L183), 
but the code must be ready for unexpected needs that may arise in 
production.




On 13/10/2020 15:58, Michel Machado wrote:

Hi Kevin,

    We do need fields max_rules and number_tbl8s of struct rte_lpm, so 
the removal would force us to have another patch to our local copy of 
DPDK. We'd rather avoid this new local patch because we wish to 
eventually be in sync with the stock DPDK.


    Those fields are needed in Gatekeeper because we found a condition 
in an ongoing deployment in which the entries of some LPM tables may 
suddenly change a lot to reflect policy changes. To avoid getting into 
a state in which the LPM table is inconsistent because it cannot fit 
all the new entries, we compute the needed parameters to support the 
new entries, and compare with the current parameters. If the current 
table doesn't fit everything, we have to replace it with a new LPM table.


    If there were a way to obtain the struct rte_lpm_config of a given 
LPM table, it would cleanly address our need. We have the same need in 
IPv6 and have a local patch to work around it (see 
https://github.com/cjdoucette/dpdk/commit/3eaf124a781349b8ec8cd880db26a78115cb8c8f). 
Thus, an IPv4 and IPv6 solution would be best.


    PS: I've added Qiaobin Fu, another Gatekeeper maintainer, to this 
disscussion.


[ ]'s
Michel Machado

On 10/13/20 9:53 AM, Kevin Traynor wrote:

Hi Gatekeeper maintainers (I think),

fyi - there is a proposal to remove some members of a struct in DPDK LPM
API that Gatekeeper is using [1]. It would be only from DPDK 20.11 but
as it's an LTS I guess it would probably hit Debian in a few months.

The full thread is here:
http://inbox.dpdk.org/dev/20200907081518.46350-1-ruifeng.w...@arm.com/

Maybe you can take a look and tell us if they are needed in Gatekeeper
or you can workaround it?

thanks,
Kevin.

[1]
https://github.com/AltraMayor/gatekeeper/blob/master/gt/lua_lpm.c#L235-L248 



On 09/10/2020 07:54, Ruifeng Wang wrote:



-Original Message-
From: Kevin Traynor 
Sent: Wednesday, September 30, 2020 4:46 PM
To: Ruifeng Wang ; Medvedkin, Vladimir
; Bruce Richardson

Cc: dev@dpdk.org; Honnappa Nagarahalli
; nd 
Subject: Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data

On 16/09/2020 04:17, Ruifeng Wang wrote:



-Original Message-
From: Medvedkin, Vladimir 
Sent: Wednesday, September 16, 2020 12:28 AM
To: Bruce Richardson ; Ruifeng Wang

Cc: dev@dpdk.org; Honnappa Nagarahalli
; nd 
Subject: Re: [PATCH 2/2] lpm: hide internal data

Hi Ruifeng,

On 15/09/2020 17:02, Bruce Richardson wrote:

On Mon, Sep 07, 2020 at 04:15:17PM +0800, Ruifeng Wang wrote:

Fields except tbl24 and tbl8 in rte_lpm structure have no need to
be exposed to the user.
Hide the unneeded exposure of structure fields for better ABI
maintainability.

Suggested-by: David Marchand 
Signed-off-by: Ruifeng Wang 
Reviewed-by: Phil Yang 
---
   lib/librte_lpm/rte_lpm.c | 152
+++---

-

   lib/librte_lpm/rte_lpm.h |   7 --
   2 files changed, 91 insertions(+), 68 deletions(-)




diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
index 03da2d37e..112d96f37 100644
--- a/lib/librte_lpm/rte_lpm.h
+++ b/lib/librte_lpm/rte_lpm.h
@@ -132,17 +132,10 @@ struct rte_lpm_rule_info {

   /** @internal LPM structure. */
   struct rte_lpm {
-    /* LPM metadata. */
-    char name[RTE_LPM_NAMESIZE];    /**< Name of the lpm. */
-    uint32_t max_rules; /**< Max. balanced rules per lpm. */
-    uint32_t number_tbl8s; /**< Number of tbl8s. */
-    struct rte_lpm_rule_info rule_info[RTE_LPM_MAX_DEPTH]; /**<

Rule info table. */

-
   /* LPM Tables. */
   struct rte_lpm_tbl_entry tbl24[RTE_LPM_TBL24_NUM_ENTRIES]
   __rte_cache_aligned; /**< LPM tbl24 table. */
   struct rte_lpm_tbl_entry *tbl8; /**< LPM tbl8 table. */
-    struct rte_lpm_rule *rules_tbl; /**< LPM rules. */
   };



Since this changes the ABI, does it not need advance notice?

[Basically the return value point from rte_lpm_create() will be
different, and that return value could be used by rte_lpm_lookup()
which as a static inline function will be in the binary and u

Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data

2020-10-13 Thread Michel Machado

Hi Kevin,

   We do need fields max_rules and number_tbl8s of struct rte_lpm, so 
the removal would force us to have another patch to our local copy of 
DPDK. We'd rather avoid this new local patch because we wish to 
eventually be in sync with the stock DPDK.


   Those fields are needed in Gatekeeper because we found a condition 
in an ongoing deployment in which the entries of some LPM tables may 
suddenly change a lot to reflect policy changes. To avoid getting into a 
state in which the LPM table is inconsistent because it cannot fit all 
the new entries, we compute the needed parameters to support the new 
entries, and compare with the current parameters. If the current table 
doesn't fit everything, we have to replace it with a new LPM table.


   If there were a way to obtain the struct rte_lpm_config of a given 
LPM table, it would cleanly address our need. We have the same need in 
IPv6 and have a local patch to work around it (see 
https://github.com/cjdoucette/dpdk/commit/3eaf124a781349b8ec8cd880db26a78115cb8c8f). 
Thus, an IPv4 and IPv6 solution would be best.


   PS: I've added Qiaobin Fu, another Gatekeeper maintainer, to this 
disscussion.


[ ]'s
Michel Machado

On 10/13/20 9:53 AM, Kevin Traynor wrote:

Hi Gatekeeper maintainers (I think),

fyi - there is a proposal to remove some members of a struct in DPDK LPM
API that Gatekeeper is using [1]. It would be only from DPDK 20.11 but
as it's an LTS I guess it would probably hit Debian in a few months.

The full thread is here:
http://inbox.dpdk.org/dev/20200907081518.46350-1-ruifeng.w...@arm.com/

Maybe you can take a look and tell us if they are needed in Gatekeeper
or you can workaround it?

thanks,
Kevin.

[1]
https://github.com/AltraMayor/gatekeeper/blob/master/gt/lua_lpm.c#L235-L248

On 09/10/2020 07:54, Ruifeng Wang wrote:



-Original Message-
From: Kevin Traynor 
Sent: Wednesday, September 30, 2020 4:46 PM
To: Ruifeng Wang ; Medvedkin, Vladimir
; Bruce Richardson

Cc: dev@dpdk.org; Honnappa Nagarahalli
; nd 
Subject: Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data

On 16/09/2020 04:17, Ruifeng Wang wrote:



-Original Message-
From: Medvedkin, Vladimir 
Sent: Wednesday, September 16, 2020 12:28 AM
To: Bruce Richardson ; Ruifeng Wang

Cc: dev@dpdk.org; Honnappa Nagarahalli
; nd 
Subject: Re: [PATCH 2/2] lpm: hide internal data

Hi Ruifeng,

On 15/09/2020 17:02, Bruce Richardson wrote:

On Mon, Sep 07, 2020 at 04:15:17PM +0800, Ruifeng Wang wrote:

Fields except tbl24 and tbl8 in rte_lpm structure have no need to
be exposed to the user.
Hide the unneeded exposure of structure fields for better ABI
maintainability.

Suggested-by: David Marchand 
Signed-off-by: Ruifeng Wang 
Reviewed-by: Phil Yang 
---
   lib/librte_lpm/rte_lpm.c | 152
+++---

-

   lib/librte_lpm/rte_lpm.h |   7 --
   2 files changed, 91 insertions(+), 68 deletions(-)




diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
index 03da2d37e..112d96f37 100644
--- a/lib/librte_lpm/rte_lpm.h
+++ b/lib/librte_lpm/rte_lpm.h
@@ -132,17 +132,10 @@ struct rte_lpm_rule_info {

   /** @internal LPM structure. */
   struct rte_lpm {
-   /* LPM metadata. */
-   char name[RTE_LPM_NAMESIZE];/**< Name of the lpm. */
-   uint32_t max_rules; /**< Max. balanced rules per lpm. */
-   uint32_t number_tbl8s; /**< Number of tbl8s. */
-   struct rte_lpm_rule_info rule_info[RTE_LPM_MAX_DEPTH]; /**<

Rule info table. */

-
/* LPM Tables. */
struct rte_lpm_tbl_entry tbl24[RTE_LPM_TBL24_NUM_ENTRIES]
__rte_cache_aligned; /**< LPM tbl24 table. */
struct rte_lpm_tbl_entry *tbl8; /**< LPM tbl8 table. */
-   struct rte_lpm_rule *rules_tbl; /**< LPM rules. */
   };



Since this changes the ABI, does it not need advance notice?

[Basically the return value point from rte_lpm_create() will be
different, and that return value could be used by rte_lpm_lookup()
which as a static inline function will be in the binary and using
the old structure offsets.]



Agree with Bruce, this patch breaks ABI, so it can't be accepted
without prior notice.


So if the change wants to happen in 20.11, a deprecation notice should
have been added in 20.08.
I should have added a deprecation notice. This change will have to wait for

next ABI update window.




Do you plan to extend? or is this just speculative?

It is speculative.



A quick scan and there seems to be several projects using some of these
members that you are proposing to hide. e.g. BESS, NFF-Go, DPVS,
gatekeeper. I didn't look at the details to see if they are really needed.

Not sure how much notice they'd need or if they update DPDK much, but I
think it's worth having a closer look as to how they use lpm and what the
impact to them is.

Checked the projects listed above. BESS, NFF-Go and DPVS don't access the 
members to be hided.
They 

Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data

2020-10-13 Thread Michel Machado

On 10/13/20 3:06 PM, Medvedkin, Vladimir wrote:



On 13/10/2020 18:46, Michel Machado wrote:

On 10/13/20 11:41 AM, Medvedkin, Vladimir wrote:

Hi Michel,

Could you please describe a condition when LPM gets inconsistent? As 
I can see if there is no free tbl8 it will return -ENOSPC.


    Consider this simple example, we need to add the following two 
prefixes with different next hops: 10.99.0.0/16, 18.99.99.128/25. If 
the LPM table is out of tbl8s, the second prefix is not added and 
Gatekeeper will make decisions in violation of the policy. The data 
structure of the LPM table is consistent, but its content inconsistent 
with the policy.


Aha, thanks. So do I understand correctly that you need to add a set of 
routes atomically (either the entire set is installed or nothing)?


   Yes.

If so, then I would suggest having 2 lpm and switching them atomically 
after a successful addition. As for now, even if you have enough tbl8's, 
routes are installed non atomically, i.e. there will be a time gap 
between adding two routes, so in this time interval the table will be 
inconsistent with the policy.
Also, if new lpm algorithms are added to the DPDK, they won't have such 
a thing as tbl8.


   Our code already deals with synchronization.

    We minimize the need of replacing a LPM table by allocating LPM 
tables with the double of what we need (see example here 
https://github.com/AltraMayor/gatekeeper/blob/95d1d6e8201861a0d0c698bfd06ad606674f1e07/lua/examples/policy.lua#L172-L183), 
but the code must be ready for unexpected needs that may arise in 
production.




Usually, the table is initialized with a large enough number of entries, 
enough to add a possible number of routes. One tbl8 group takes up 1Kb 
of memory which is nothing comparing to the size of tbl24 which is 64Mb.


   When the prefixes come from BGP, initializing a large enough table 
is fine. But when prefixes come from threat intelligence, the number of 
prefixes can vary wildly and the number of prefixes above 24 bits are 
way more common.


P.S. consider using rte_fib library, it has a number of advantages over 
LPM. You can replace the loop in __lookup_fib_bulk() with a bulk lookup 
call and this will probably increase the speed.


   I'm not aware of the rte_fib library. The only documentation that I 
found on Google was https://doc.dpdk.org/api/rte__fib_8h.html and it 
just says "FIB (Forwarding information base) implementation for IPv4 
Longest Prefix Match".




On 13/10/2020 15:58, Michel Machado wrote:

Hi Kevin,

    We do need fields max_rules and number_tbl8s of struct rte_lpm, 
so the removal would force us to have another patch to our local 
copy of DPDK. We'd rather avoid this new local patch because we wish 
to eventually be in sync with the stock DPDK.


    Those fields are needed in Gatekeeper because we found a 
condition in an ongoing deployment in which the entries of some LPM 
tables may suddenly change a lot to reflect policy changes. To avoid 
getting into a state in which the LPM table is inconsistent because 
it cannot fit all the new entries, we compute the needed parameters 
to support the new entries, and compare with the current parameters. 
If the current table doesn't fit everything, we have to replace it 
with a new LPM table.


    If there were a way to obtain the struct rte_lpm_config of a 
given LPM table, it would cleanly address our need. We have the same 
need in IPv6 and have a local patch to work around it (see 
https://github.com/cjdoucette/dpdk/commit/3eaf124a781349b8ec8cd880db26a78115cb8c8f). 
Thus, an IPv4 and IPv6 solution would be best.


    PS: I've added Qiaobin Fu, another Gatekeeper maintainer, to 
this disscussion.


[ ]'s
Michel Machado

On 10/13/20 9:53 AM, Kevin Traynor wrote:

Hi Gatekeeper maintainers (I think),

fyi - there is a proposal to remove some members of a struct in 
DPDK LPM

API that Gatekeeper is using [1]. It would be only from DPDK 20.11 but
as it's an LTS I guess it would probably hit Debian in a few months.

The full thread is here:
http://inbox.dpdk.org/dev/20200907081518.46350-1-ruifeng.w...@arm.com/

Maybe you can take a look and tell us if they are needed in Gatekeeper
or you can workaround it?

thanks,
Kevin.

[1]
https://github.com/AltraMayor/gatekeeper/blob/master/gt/lua_lpm.c#L235-L248 



On 09/10/2020 07:54, Ruifeng Wang wrote:



-Original Message-
From: Kevin Traynor 
Sent: Wednesday, September 30, 2020 4:46 PM
To: Ruifeng Wang ; Medvedkin, Vladimir
; Bruce Richardson

Cc: dev@dpdk.org; Honnappa Nagarahalli
; nd 
Subject: Re: [dpdk-dev] [PATCH 2/2] lpm: hide internal data

On 16/09/2020 04:17, Ruifeng Wang wrote:



-Original Message-
From: Medvedkin, Vladimir 
Sent: Wednesday, September 16, 2020 12:28 AM
To: Bruce Richardson ; Ruifeng Wang

Cc: dev@dpdk.org; Honnappa Nagarahalli
; nd 
Subject: Re: [PATCH 2/2] lpm: hide internal data

Hi Ruifeng,

On 15/09/2020 17:02, Bruce Ric