dequeue

Jia He Thu, 12 Oct 2017 20:24:03 -0700

Hi Jerin


On 10/13/2017 9:49 AM, Jerin Jacob Wrote:

-----Original Message-----

Date: Fri, 13 Oct 2017 09:16:31 +0800
From: Jia He <[email protected]>
To: Jerin Jacob <[email protected]>, "Ananyev, Konstantin"
  <[email protected]>
Cc: Olivier MATZ <[email protected]>, "[email protected]" <[email protected]>,
  "[email protected]" <[email protected]>,
  "[email protected]" <[email protected]>,
  "[email protected]" <[email protected]>
Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when
  doing enqueue/dequeue
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
  Thunderbird/52.3.0

Hi


On 10/13/2017 9:02 AM, Jia He Wrote:

Hi Jerin


On 10/13/2017 1:23 AM, Jerin Jacob Wrote:

-----Original Message-----

Date: Thu, 12 Oct 2017 17:05:50 +0000

[...]

On the same lines,

Jia He, jie2.liu, bing.zhao,

Is this patch based on code review or do you saw this issue on any
of the
arm/ppc target? arm64 will have performance impact with this change.

sorry, miss one important information
Our platform is an aarch64 server with 46 cpus.

Is this an OOO(Out of order execution) aarch64 CPU implementation?

I think so, it is a server cpu (ARMv8-A), but do you know how to confirm it?
cat /proc/cpuinfo
processor       : 0
BogoMIPS        : 40.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid asimdrdm
CPU implementer : 0x51
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0x800
CPU revision    : 0

If we reduced the involved cpu numbers, the bug occurred less frequently.

Yes, mb barrier impact the performance, but correctness is more important,
isn't it ;-)

Yes.

Maybe we can  find any other lightweight barrier here?

Yes, Regarding the lightweight barrier, arm64 has native support for acquire 
and release
semantics, which is exposed through gcc as architecture agnostic
functions.
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
http://preshing.com/20130922/acquire-and-release-fences/

Good to know,
1) How much overhead this patch in your platform? Just relative
numbers are enough

I create a *standalone* test case for test_mbuf
Attached the debug patch

It is hard to believe but the truth is that the performance after addingrmb barrier

is better than no adding.

With this patch (4 times running)
time ./test_good --no-huge -l 1-20
real    0m23.311s
user    7m21.870s
sys     0m0.021s

time ./test_bad --no-huge -l 1-20
Without this patch
real    0m38.972s
user    12m35.271s
sys     0m0.030s

Cheers,
Jia

2) As a prototype, Is Changing to acquire and release schematics
reduces the overhead in your platform?

Reference FreeBSD ring/DPDK style ring implementation through acquire
and release schematics
https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c

I will also spend on cycles on this.

Cheers,
Jia

Based on mbuf_autotest, the rte_panic will be invoked in seconds.

PANIC in test_refcnt_iter():
(lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
1: [./test(rte_dump_stack+0x38) [0x58d868]]
Aborted (core dumped)

Cheers,
Jia

Konstantin

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7..23168e7 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -517,6 +517,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
                n = max;
 
                *old_head = r->cons.head;
+               rte_smp_rmb();
                const uint32_t prod_tail = r->prod.tail;
                /* The subtraction is done between two unsigned 32bits value
                 * (the result is always modulo 32 bits even if we have
diff --git a/test/test/test.c b/test/test/test.c
index 9accbd1..6910b06 100644
--- a/test/test/test.c
+++ b/test/test/test.c
@@ -61,6 +61,7 @@ extern cmdline_parse_ctx_t main_ctx[];
 
 #include "test.h"
 
+extern int test_mbuf(void);
 #define RTE_LOGTYPE_APP RTE_LOGTYPE_USER1
 
 const char *prgname; /* to be set to argv[0] */
@@ -135,6 +136,9 @@ main(int argc, char **argv)
                RTE_LOG(INFO, APP,
                                "HPET is not enabled, using TSC as default 
timer\n");
 
+       test_mbuf();
+       printf("test_mbuf done\n");
+       return 0;
 
 #ifdef RTE_LIBRTE_CMDLINE
        cl = cmdline_stdin_new(main_ctx, "RTE>>");
diff --git a/test/test/test_mbuf.c b/test/test/test_mbuf.c
index 3396b4a..7277260 100644
--- a/test/test/test_mbuf.c
+++ b/test/test/test_mbuf.c
@@ -59,7 +59,6 @@
 #include <rte_cycles.h>
 
 #include "test.h"
-
 #define MBUF_DATA_SIZE          2048
 #define NB_MBUF                 128
 #define MBUF_TEST_DATA_LEN      1464
@@ -85,7 +84,7 @@
 
 static volatile uint32_t refcnt_stop_slaves;
 static unsigned refcnt_lcore[RTE_MAX_LCORE];
-
+int test_mbuf(void);
 #endif
 
 /*
@@ -718,7 +717,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,
        for (i = 0, n = rte_mempool_avail_count(refcnt_pool);
            i != n && (m = rte_pktmbuf_alloc(refcnt_pool)) != NULL;
            i++) {
-               ref = RTE_MAX(rte_rand() % REFCNT_MAX_REF, 1UL);
+               ref = REFCNT_MAX_REF;
                tref += ref;
                if ((ref & 1) != 0) {
                        rte_pktmbuf_refcnt_update(m, ref);
@@ -745,14 +744,17 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,
        for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) {
                if ((i = rte_mempool_avail_count(refcnt_pool)) == n) {
                        refcnt_lcore[lcore] += tref;
-                       printf("%s(lcore=%u, iter=%u) completed, "
+                       /*printf("%s(lcore=%u, iter=%u) completed, "
                            "%u references processed\n",
-                           __func__, lcore, iter, tref);
+                           __func__, lcore, iter, tref);*/
                        return;
                }
                rte_delay_ms(100);
        }
 
+        rte_mempool_dump(stdout, refcnt_pool);
+       rte_ring_dump(stdout, refcnt_pool->pool_data);
+        rte_ring_dump(stdout, refcnt_mbuf_ring);
        rte_panic("(lcore=%u, iter=%u): after %us only "
                  "%u of %u mbufs left free\n", lcore, iter, wn, i, n);
 }
@@ -766,7 +768,7 @@ test_refcnt_master(struct rte_mempool *refcnt_pool,
        lcore = rte_lcore_id();
        printf("%s started at lcore %u\n", __func__, lcore);
 
-       for (i = 0; i != REFCNT_MAX_ITER; i++)
+       for (i = 0; i != 10*REFCNT_MAX_ITER; i++)
                test_refcnt_iter(lcore, i, refcnt_pool, refcnt_mbuf_ring);
 
        refcnt_stop_slaves = 1;
@@ -1058,8 +1060,7 @@ test_mbuf_linearize_check(struct rte_mempool 
*pktmbuf_pool)
        return 0;
 }
 
-static int
-test_mbuf(void)
+int test_mbuf(void)
 {
        int ret = -1;
        struct rte_mempool *pktmbuf_pool = NULL;

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

Reply via email to