Hi Jerin
On 10/23/2017 6:06 PM, Jerin Jacob Wrote:
-----Original Message-----
Date: Mon, 23 Oct 2017 16:49:01 +0800
From: Jia He <hejia...@gmail.com>
To: Jerin Jacob <jerin.ja...@caviumnetworks.com>
Cc: "Ananyev, Konstantin" <konstantin.anan...@intel.com>, "Zhao, Bing"
<iloveth...@163.com>, Olivier MATZ <olivier.m...@6wind.com>,
"dev@dpdk.org" <dev@dpdk.org>, "jia...@hxt-semitech.com"
<jia...@hxt-semitech.com>, "jie2....@hxt-semitech.com"
<jie2....@hxt-semitech.com>, "bing.z...@hxt-semitech.com"
<bing.z...@hxt-semitech.com>, "Richardson, Bruce"
<bruce.richard...@intel.com>
Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
loading when doing enqueue/dequeue
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
Thunderbird/52.4.0
Hi Jerin
On 10/20/2017 1:43 PM, Jerin Jacob Wrote:
-----Original Message-----
[...]
dependant on each other.
Thus a memory barrier is neccessary.
Yes. The barrier is necessary.
In fact, upstream freebsd fixed this issue for arm64. DPDK ring
implementation is derived from freebsd's buf_ring.h.
https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
I think, the only outstanding issue is, how to reduce the performance
impact for arm64. I believe using accurate/release semantics instead
of rte_smp_rmb() will reduce the performance overhead like similar ring
implementations below,
freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
odp:
https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c
Jia,
1) Can you verify the use of accurate/release semantics fixes the problem in
your
platform? like use of atomic_load_acq* in the reference code.
2) If so, What is the overhead between accurate/release and plane smp_smb()
barriers. Based on that we need decide what path to take.
I've tested 3 cases. The new 3rd case is to use the load_acquire barrier
(half barrier) you mentioned
at above link.
The patch seems like:
@@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
/* Reset n to the initial burst count */
n = max;
- *old_head = r->prod.head;
- const uint32_t cons_tail = r->cons.tail;
+ *old_head = atomic_load_acq_32(&r->prod.head);
+ const uint32_t cons_tail =
atomic_load_acq_32(&r->cons.tail);
@@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s
/* Restore n as it may change every loop */
n = max;
- *old_head = r->cons.head;
- const uint32_t prod_tail = r->prod.tail;
+ *old_head = atomic_load_acq_32(&r->cons.head);
+ const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail)
/* The subtraction is done between two unsigned 32bits value
* (the result is always modulo 32 bits even if we have
* cons_head > prod_tail). So 'entries' is always between 0
* and size(ring)-1. */
The half barrier patch passed the fuctional test.
As for the performance comparision on *arm64*(the debug patch is at
http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see the
test results
below:
[case 1] old codes, no barrier
============================================
Performance counter stats for './test --no-huge -l 1-10':
689275.001200 task-clock (msec) # 9.771 CPUs utilized
6223 context-switches # 0.009 K/sec
10 cpu-migrations # 0.000 K/sec
653 page-faults # 0.001 K/sec
1721190914583 cycles # 2.497 GHz
3363238266430 instructions # 1.95 insn per cycle
<not supported> branches
27804740 branch-misses # 0.00% of all branches
70.540618825 seconds time elapsed
[case 2] full barrier with rte_smp_rmb()
============================================
Performance counter stats for './test --no-huge -l 1-10':
582557.895850 task-clock (msec) # 9.752 CPUs utilized
5242 context-switches # 0.009 K/sec
10 cpu-migrations # 0.000 K/sec
665 page-faults # 0.001 K/sec
1454360730055 cycles # 2.497 GHz
587197839907 instructions # 0.40 insn per cycle
<not supported> branches
27799687 branch-misses # 0.00% of all branches
59.735582356 seconds time elapse
[case 1] half barrier with load_acquire
============================================
Performance counter stats for './test --no-huge -l 1-10':
660758.877050 task-clock (msec) # 9.764 CPUs utilized
5982 context-switches # 0.009 K/sec
11 cpu-migrations # 0.000 K/sec
657 page-faults # 0.001 K/sec
1649875318044 cycles # 2.497 GHz
591583257765 instructions # 0.36 insn per cycle
<not supported> branches
27994903 branch-misses # 0.00% of all branches
67.672855107 seconds time elapsed
Please see the context-switches in the perf results
test result sorted by time is:
full barrier < half barrier < no barrier
AFAICT, in this case ,the cpu reordering will add the possibility for
context switching and
increase the running time.
Any ideas?
Regarding performance test, it better to use ring perf test case
on _isolated_ cores to measure impact on number of enqueue/dequeue operations.
example:
./build/app/test -c 0xff -n 4
ring_perf_autotest
Seem in our arm64 server, the ring_perf_autotest will be finished in a
few seconds:
Anything wrong about configuration or environment setup?
root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4
EAL: Detected 44 lcore(s)
EAL: Probing VFIO support...
APP: HPET is not enabled, using TSC as default timer
RTE>>per_lcore_autotest
RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 0
MP/MC single enq/dequeue: 2
SP/SC burst enq/dequeue (size: 8): 0
MP/MC burst enq/dequeue (size: 8): 0
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0
### Testing empty dequeue ###
SC empty dequeue: 0.02
MC empty dequeue: 0.04
### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 0.12
MP/MC bulk enq/dequeue (size: 8): 0.31
SP/SC bulk enq/dequeue (size: 32): 0.05
MP/MC bulk enq/dequeue (size: 32): 0.09
### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 0.12
MP/MC bulk enq/dequeue (size: 8): 0.39
SP/SC bulk enq/dequeue (size: 32): 0.04
MP/MC bulk enq/dequeue (size: 32): 0.12
### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 0.37
MP/MC bulk enq/dequeue (size: 8): 0.92
SP/SC bulk enq/dequeue (size: 32): 0.12
MP/MC bulk enq/dequeue (size: 32): 0.26
Test OK
RTE>>
Cheers,
Jia
By default, arm64+dpdk will be using el0 counter to measure the cycles. I
think, in your SoC, it will be running at 50MHz or 100MHz.So, You can
follow the below scheme to get accurate cycle measurement scheme:
See: http://dpdk.org/doc/guides/prog_guide/profile_app.html
check: 44.2.2. High-resolution cycle counter