[dpdk-dev] [PATCH 17/17] libte_acl: fix compilation issues with RTE_LIBRTE_ACL_STANDALONE=y.

2014-12-14 Thread Konstantin Ananyev
Signed-off-by: Konstantin Ananyev --- lib/librte_acl/rte_acl_osdep_alone.h | 47 ++-- 1 file changed, 45 insertions(+), 2 deletions(-) diff --git a/lib/librte_acl/rte_acl_osdep_alone.h b/lib/librte_acl/rte_acl_osdep_alone.h index a84b6f9..58c4f6a 100644 --- a/lib

[dpdk-dev] [PATCH 16/17] libte_acl: remove unused macros.

2014-12-14 Thread Konstantin Ananyev
Signed-off-by: Konstantin Ananyev --- lib/librte_acl/acl.h | 39 ++- lib/librte_acl/acl_run.h | 1 - 2 files changed, 38 insertions(+), 2 deletions(-) diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h index 61b849a..e65e079 100644 --- a/lib/librte_

[dpdk-dev] [PATCH 15/17] libte_acl: introduce max_size into rte_acl_config.

2014-12-14 Thread Konstantin Ananyev
If at build phase we don't make any trie splitting, then temporary build structures and resulting RT structure might be much bigger than current. >From other side - having just one trie instead of multiple can speedup search quite significantly. >From my measurements on rule-sets with ~10K rules: R

[dpdk-dev] [PATCH 14/17] libte_acl: make calc_addr a define to deduplicate the code.

2014-12-14 Thread Konstantin Ananyev
Vector code reorganisation/deduplication: To avoid maintaining two nearly identical implementations of calc_addr() (one for SSE, another for AVX2), replace it with a new macro that suits both SSE and AVX2 code-paths. Also remove no needed any more MM_* macros. Signed-off-by: Konstantin Ananyev -

[dpdk-dev] [PATCH 13/17] libter_acl: move lo/hi dwords shuffle out from calc_addr

2014-12-14 Thread Konstantin Ananyev
Reorganise SSE code-path a bit by moving lo/hi dwords shuffle out from calc_addr(). That allows to make calc_addr() for SSE and AVX2 practically identical and opens opportunity for further code deduplication. Signed-off-by: Konstantin Ananyev --- lib/librte_acl/acl_run_sse.h | 38 +++

[dpdk-dev] [PATCH 12/17] librte_acl: Remove search_sse_2 and relatives.

2014-12-14 Thread Konstantin Ananyev
Previous improvements made scalar method the fastest one for tiny bunch of packets (< 4). That allows us to remove specific vector code-path for small number of packets (search_sse_2) and always use scalar method for such cases. Signed-off-by: Konstantin Ananyev --- lib/librte_acl/acl_run_avx2.c

[dpdk-dev] [PATCH 11/17] test-acl: add ability to manually select RT method.

2014-12-14 Thread Konstantin Ananyev
In test-acl replace command-line option "--scalar" with new one: "--alg=scalar|sse|avx2". Allows user manually select preferred classify() method. Signed-off-by: Konstantin Ananyev --- app/test-acl/main.c | 93 ++--- 1 file changed, 75 insertions(+

[dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method

2014-12-14 Thread Konstantin Ananyev
Introduce new classify() method that uses AVX2 instructions. >From my measurements: On HSW boards when processing >= 16 packets per call, AVX2 method outperforms it's SSE counterpart by 10-25%, (depending on the ruleset). At runtime, this method is selected as default one on HW that supports AVX2.

[dpdk-dev] [PATCH 09/17] EAL: introduce rte_ymm and relatives in rte_common_vect.h.

2014-12-14 Thread Konstantin Ananyev
New data type to manipulate 256 bit AVX values. Rename field in the rte_xmm to keep common naming accross SSE/AVX fields. Signed-off-by: Konstantin Ananyev --- examples/l3fwd/main.c | 2 +- lib/librte_acl/acl_run_sse.c| 88 -

[dpdk-dev] [PATCH 08/17] librte_acl: a bit of RT code deduplication.

2014-12-14 Thread Konstantin Ananyev
Move common check for input parameters up into rte_acl_classify_alg(). Signed-off-by: Konstantin Ananyev --- lib/librte_acl/acl_run_scalar.c | 4 lib/librte_acl/acl_run_sse.c| 4 lib/librte_acl/rte_acl.c| 19 --- 3 files changed, 12 insertions(+), 15 delet

[dpdk-dev] [PATCH 07/17] librte_acl: make scalar RT code to be more similar to vector one.

2014-12-14 Thread Konstantin Ananyev
Make classify_scalar to behave in the same way as it's vector counterpart: move match check out of the inner loop, etc. That makes scalar and vector code look more identical. Plus it improves scalar code performance. Signed-off-by: Konstantin Ananyev --- lib/librte_acl/acl_run_scalar.c | 23

[dpdk-dev] [PATCH 06/17] librte_acl: build/gen phase - simplify the way match nodes are allocated.

2014-12-14 Thread Konstantin Ananyev
Right now we allocate indexes for all types of nodes, except MATCH, at 'gen final RT table' stage. For MATCH type nodes we are doing it at building temporary tree stage. This is totally unnecessary and makes code more complex and error prone. Rework the code and make MATCH indexes being allocated a

[dpdk-dev] [PATCH 05/17] librte_acl: introduce DFA nodes compression (group64) for identical entries.

2014-12-14 Thread Konstantin Ananyev
Introduced division of whole 256 child transition enties into 4 sub-groups (64 kids per group). So 2 groups within the same node with identical children, can use one set of transition entries. That allows to compact some DFA nodes and get space savings in the RT table, without any negative performa

[dpdk-dev] [PATCH 04/17] librte_acl: fix a bug at build phase that can cause matches beeing overwirtten.

2014-12-14 Thread Konstantin Ananyev
Signed-off-by: Konstantin Ananyev --- lib/librte_acl/acl_bld.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c index 8bf4a54..22f7934 100644 --- a/lib/librte_acl/acl_bld.c +++ b/lib/librte_acl/acl_bld.c @@ -1907,7 +1907,7 @@

[dpdk-dev] [PATCH 03/17] librte_acl: remove build phase heuristsic with negative perfomance effect.

2014-12-14 Thread Konstantin Ananyev
Current rule-wildness based heuristsics can cause unnecessary splits of the ruleset. That might have negative perfomance effect: more tries to traverse, bigger RT tables. After removing it, on some test-cases with big rulesets (~10K) observed ~50% speedup. No difference for smaller rulesets. Sign

[dpdk-dev] [PATCH 02/17] librte_acl: make data_indexes long enough to survive idle transitions.

2014-12-14 Thread Konstantin Ananyev
Make data_indexes long enough to survive idle transitions. That allows to simplify match processing code. Also fix incorrect size calculations for data indexes. Signed-off-by: Konstantin Ananyev --- lib/librte_acl/acl_bld.c | 5 +++-- lib/librte_acl/acl_run.h | 4 2 files changed, 3 inserti

[dpdk-dev] [PATCH 01/17] app/test: few small fixes fot test_acl.c

2014-12-14 Thread Konstantin Ananyev
Make sure that test_acl would not ignore error conditions. Run classify() with all possible values. Signed-off-by: Konstantin Ananyev --- app/test/test_acl.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/app/test/test_acl.c b/app/test/test_acl.c index 356d620..7119a

[dpdk-dev] [PATCH 00/17] ACL: New AVX2 classify method and several other enhancements.

2014-12-14 Thread Konstantin Ananyev
This patch series contain several fixes and enhancements for ACL library. See complete list below. Two main changes that are externally visible: - Introduce new classify method: RTE_ACL_CLASSIFY_AVX2. It uses AVX2 instructions and 256 bit wide data types to perform internal trie traversal. That he