Hi Nelio > -----Original Message----- > From: Nélio Laranjeiro [mailto:nelio.laranje...@6wind.com] > Sent: Thursday, October 26, 2017 4:44 PM > To: Matan Azrad <ma...@mellanox.com> > Cc: Ophir Munk <ophi...@mellanox.com>; Adrien Mazarguil > <adrien.mazarg...@6wind.com>; dev@dpdk.org; Thomas Monjalon > <tho...@monjalon.net>; Olga Shern <ol...@mellanox.com>; Mordechay > Haimovsky <mo...@mellanox.com> > Subject: Re: [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path functions > > On Thu, Oct 26, 2017 at 12:30:54PM +0000, Matan Azrad wrote: > > Hi Nelio > > Please see my comments below (3). > > > > > > > -----Original Message----- > > > From: Nélio Laranjeiro [mailto:nelio.laranje...@6wind.com] > > > Sent: Thursday, October 26, 2017 3:12 PM > > > To: Matan Azrad <ma...@mellanox.com> > > > Cc: Ophir Munk <ophi...@mellanox.com>; Adrien Mazarguil > > > <adrien.mazarg...@6wind.com>; dev@dpdk.org; Thomas Monjalon > > > <tho...@monjalon.net>; Olga Shern <ol...@mellanox.com>; > Mordechay > > > Haimovsky <mo...@mellanox.com> > > > Subject: Re: [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path > > > functions > > > > > > On Thu, Oct 26, 2017 at 10:31:06AM +0000, Matan Azrad wrote: > > > > Hi Nelio > > > > > > > > I think the memory barrier discussion is not relevant for this > > > > patch (if it will be relevant I will create new one). > > > > Please see my comments inline. > > > > > > It was not my single comment. There is also useless code like > > > having null segments in the packets which is not allowed on DPDK. > > > > Sorry, but I can't find comments in the previous mails. > > You should search in the series, > > > Moreover this comment(first time I see it) is not relevant to this patch > > and > asking something else. > > All what this patch does is to merge 2 functions to prevent double > > asking about WQ remain space... > > Again in the series itself. > > The point, this series embed 7 patches for "performance improvement", > whereas the single improvement is avoiding to call an outside function by > copy/pasting it into the PMD. > In fact it will save few cycles, but this improvements could have been much > more if the it was not a bare copy/paste. >
This simple merge improves 0.2MPPS in my setup. If you have more improvements (other than reduce if statement) regarding this merge please suggest. > The real question is what is the improvement? If the improvement is > significant, it worse having this series, otherwise it does not as it may also > bring some bugs which may be resolve from its original source whereas this > one will remain. > Each commit in this series improves performance - all of them improve performance significantly and brought us to our target. By the way, I think series discussion should be in patch 0 :) > > Remove memory\compiler barriers or dealing with null segments are not in > the scope here. > > > > > > > > > Regarding this specific patch, I didn't see any comment from you, > > > > Are you agree with it? > > > > > > > > > -----Original Message----- > > > > > From: Nélio Laranjeiro [mailto:nelio.laranje...@6wind.com] > > > > > Sent: Wednesday, October 25, 2017 10:50 AM > > > > > To: Ophir Munk <ophi...@mellanox.com> > > > > > Cc: Adrien Mazarguil <adrien.mazarg...@6wind.com>; dev@dpdk.org; > > > > > Thomas Monjalon <tho...@monjalon.net>; Olga Shern > > > > > <ol...@mellanox.com>; Matan Azrad <ma...@mellanox.com> > > > > > Subject: Re: [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path > > > > > functions > > > > > > > > > > On Tue, Oct 24, 2017 at 08:36:52PM +0000, Ophir Munk wrote: > > > > > > Hi, > > > > > > > > > > > > On Tuesday, October 24, 2017 4:52 PM, Nélio Laranjeiro wrote: > > > > > > > > > > > > > > On Mon, Oct 23, 2017 at 02:21:57PM +0000, Ophir Munk wrote: > > > > > > > > From: Matan Azrad <ma...@mellanox.com> > > > > > > > > > > > > > > > > Merge tx_burst and mlx4_post_send functions to prevent > > > > > > > > double asking about WQ remain space. > > > > > > > > > > > > > > > > This should improve performance. > > > > > > > > > > > > > > > > Signed-off-by: Matan Azrad <ma...@mellanox.com> > > > > > > > > --- > > > > > > > > drivers/net/mlx4/mlx4_rxtx.c | 353 > > > > > > > > +++++++++++++++++++++---------------------- > > > > > > > > 1 file changed, 170 insertions(+), 183 deletions(-) > > > > > > > > > > > > > > What are the real expectation you have on the remaining > > > > > > > patches of the series? > > > > > > > > > > > > > > According to the comment of this commit log "This should > > > > > > > improve performance" there are too many barriers at each > > > > > > > packet/segment level to improve something. > > > > > > > > > > > > > > The point is, mlx4_burst_tx() should write all the WQE > > > > > > > without any barrier as it is processing a burst of packets > > > > > > > (whereas Verbs functions which may only process a single > packet). > > > > > > > > > > > > > The lonely barrier which should be present is the one to > > > > > > > ensure that all the host memory is flushed before triggering the > > > > > > > Tx > doorbell. > > > > > > > > > > > > > > > > > > > There is a known ConnectX-3 HW limitation: the first 4 bytes > > > > > > of every TXWBB (64 bytes chunks) should be written in a > > > > > > reversed order (from last TXWBB to first TXWBB). > > > > > > > > > > This means the first WQE filled by the burst function is the doorbell. > > > > > In such situation, the first four bytes of it can be written > > > > > before leaving the burst function and after a write memory barrier. > > > > > > > > > > Until this first WQE is not complete, the NIC won't start > > > > > processing the packets. Memory barriers per packets becomes > useless. > > > > > > > > I think this is not true, Since mlx4 HW can prefetch advanced > > > > TXbbs if their first 4 bytes are valid in spite of the first WQE > > > > is still not valid (please > > > read the spec). > > > > > > A compiler barrier is enough on x86 to forbid the CPU to re-order > > > the instructions, on arm you need a memory barrier, there is a macro > > > in DPDK for that, rte_io_wmb(). > > > > > We are also using compiler barrier here. > > > > > Before triggering the doorbell you must flush the case, this is the > > > only place where the rte_wmb() should be used. > > > > > > > We are also using memory barrier only for this reason. > > > > > > > It gives something like: > > > > > > > > > > uint32_t tx_bb_db = 0; > > > > > void *first_wqe = NULL; > > > > > > > > > > /* > > > > > * Prepare all Packets by writing the WQEs without the 4 first bytes > > > > > of > > > > > * the first WQE. > > > > > */ > > > > > for () { > > > > > if (!wqe) { > > > > > first_wqe = wqe; > > > > > tx_bb_db = foo; > > > > > } > > > > > } > > > > > /* Leaving. */ > > > > > rte_wmb(); > > > > > *(uin32_t*)wqe = tx_bb_db; > > > > > return n; > > > > > > > > > > > > > I will take care to check if we can do 2 loops: > > > > Write all last 60B per TXbb. > > > > Memory barrier. > > > > Write all first 4B per TXbbs. > > > > > > > > > > The last 60 bytes of any TXWBB can be written in any order > > > > > > (before writing the first 4 bytes). > > > > > > Is your last statement (using lonely barrier) is in accordance > > > > > > with this limitation? Please explain. > > > > > > > > > > > > > There is also too many cases handled which are useless in > > > > > > > bursts > > > > > situation, > > > > > > > this function needs to be re-written to its minimal use case i.e. > > > > > processing a > > > > > > > valid burst of packets/segments and triggering at the end of > > > > > > > the burst the > > > > > Tx > > > > > > > doorbell. > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > -- > > > > > Nélio Laranjeiro > > > > > 6WIND > > > > > > Regards, > > > > > > -- > > > Nélio Laranjeiro > > > 6WIND > > -- > Nélio Laranjeiro > 6WIND