Signed-off-by: Konstantin Ananyev
---
lib/librte_acl/rte_acl_osdep_alone.h | 47 ++--
1 file changed, 45 insertions(+), 2 deletions(-)
diff --git a/lib/librte_acl/rte_acl_osdep_alone.h
b/lib/librte_acl/rte_acl_osdep_alone.h
index a84b6f9..58c4f6a 100644
--- a/lib
Signed-off-by: Konstantin Ananyev
---
lib/librte_acl/acl.h | 39 ++-
lib/librte_acl/acl_run.h | 1 -
2 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 61b849a..e65e079 100644
--- a/lib/librte_
If at build phase we don't make any trie splitting,
then temporary build structures and resulting RT structure might be
much bigger than current.
>From other side - having just one trie instead of multiple can speedup
search quite significantly.
>From my measurements on rule-sets with ~10K rules:
R
Vector code reorganisation/deduplication:
To avoid maintaining two nearly identical implementations of calc_addr()
(one for SSE, another for AVX2), replace it with a new macro that suits
both SSE and AVX2 code-paths.
Also remove no needed any more MM_* macros.
Signed-off-by: Konstantin Ananyev
-
Reorganise SSE code-path a bit by moving lo/hi dwords shuffle
out from calc_addr().
That allows to make calc_addr() for SSE and AVX2 practically identical
and opens opportunity for further code deduplication.
Signed-off-by: Konstantin Ananyev
---
lib/librte_acl/acl_run_sse.h | 38 +++
Previous improvements made scalar method the fastest one
for tiny bunch of packets (< 4).
That allows us to remove specific vector code-path for small number of packets
(search_sse_2)
and always use scalar method for such cases.
Signed-off-by: Konstantin Ananyev
---
lib/librte_acl/acl_run_avx2.c
In test-acl replace command-line option "--scalar" with new one:
"--alg=scalar|sse|avx2".
Allows user manually select preferred classify() method.
Signed-off-by: Konstantin Ananyev
---
app/test-acl/main.c | 93 ++---
1 file changed, 75 insertions(+
Introduce new classify() method that uses AVX2 instructions.
>From my measurements:
On HSW boards when processing >= 16 packets per call,
AVX2 method outperforms it's SSE counterpart by 10-25%,
(depending on the ruleset).
At runtime, this method is selected as default one on HW that supports AVX2.
New data type to manipulate 256 bit AVX values.
Rename field in the rte_xmm to keep common naming accross SSE/AVX fields.
Signed-off-by: Konstantin Ananyev
---
examples/l3fwd/main.c | 2 +-
lib/librte_acl/acl_run_sse.c| 88 -
Move common check for input parameters up into rte_acl_classify_alg().
Signed-off-by: Konstantin Ananyev
---
lib/librte_acl/acl_run_scalar.c | 4
lib/librte_acl/acl_run_sse.c| 4
lib/librte_acl/rte_acl.c| 19 ---
3 files changed, 12 insertions(+), 15 delet
Make classify_scalar to behave in the same way as it's vector counterpart:
move match check out of the inner loop, etc.
That makes scalar and vector code look more identical.
Plus it improves scalar code performance.
Signed-off-by: Konstantin Ananyev
---
lib/librte_acl/acl_run_scalar.c | 23
Right now we allocate indexes for all types of nodes, except MATCH,
at 'gen final RT table' stage.
For MATCH type nodes we are doing it at building temporary tree stage.
This is totally unnecessary and makes code more complex and error prone.
Rework the code and make MATCH indexes being allocated a
Introduced division of whole 256 child transition enties
into 4 sub-groups (64 kids per group).
So 2 groups within the same node with identical children,
can use one set of transition entries.
That allows to compact some DFA nodes and get space savings in the RT table,
without any negative performa
Signed-off-by: Konstantin Ananyev
---
lib/librte_acl/acl_bld.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index 8bf4a54..22f7934 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -1907,7 +1907,7 @@
Current rule-wildness based heuristsics can cause unnecessary splits of
the ruleset.
That might have negative perfomance effect:
more tries to traverse, bigger RT tables.
After removing it, on some test-cases with big rulesets (~10K)
observed ~50% speedup.
No difference for smaller rulesets.
Sign
Make data_indexes long enough to survive idle transitions.
That allows to simplify match processing code.
Also fix incorrect size calculations for data indexes.
Signed-off-by: Konstantin Ananyev
---
lib/librte_acl/acl_bld.c | 5 +++--
lib/librte_acl/acl_run.h | 4
2 files changed, 3 inserti
Make sure that test_acl would not ignore error conditions.
Run classify() with all possible values.
Signed-off-by: Konstantin Ananyev
---
app/test/test_acl.c | 8 ++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 356d620..7119a
This patch series contain several fixes and enhancements for ACL library.
See complete list below.
Two main changes that are externally visible:
- Introduce new classify method: RTE_ACL_CLASSIFY_AVX2.
It uses AVX2 instructions and 256 bit wide data types
to perform internal trie traversal.
That he
18 matches
Mail list logo