Hi Nelio I think the memory barrier discussion is not relevant for this patch (if it will be relevant I will create new one). Please see my comments inline.
Regarding this specific patch, I didn't see any comment from you, Are you agree with it? > -----Original Message----- > From: Nélio Laranjeiro [mailto:nelio.laranje...@6wind.com] > Sent: Wednesday, October 25, 2017 10:50 AM > To: Ophir Munk <ophi...@mellanox.com> > Cc: Adrien Mazarguil <adrien.mazarg...@6wind.com>; dev@dpdk.org; > Thomas Monjalon <tho...@monjalon.net>; Olga Shern > <ol...@mellanox.com>; Matan Azrad <ma...@mellanox.com> > Subject: Re: [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path functions > > On Tue, Oct 24, 2017 at 08:36:52PM +0000, Ophir Munk wrote: > > Hi, > > > > On Tuesday, October 24, 2017 4:52 PM, Nélio Laranjeiro wrote: > > > > > > On Mon, Oct 23, 2017 at 02:21:57PM +0000, Ophir Munk wrote: > > > > From: Matan Azrad <ma...@mellanox.com> > > > > > > > > Merge tx_burst and mlx4_post_send functions to prevent double > > > > asking about WQ remain space. > > > > > > > > This should improve performance. > > > > > > > > Signed-off-by: Matan Azrad <ma...@mellanox.com> > > > > --- > > > > drivers/net/mlx4/mlx4_rxtx.c | 353 > > > > +++++++++++++++++++++---------------------- > > > > 1 file changed, 170 insertions(+), 183 deletions(-) > > > > > > What are the real expectation you have on the remaining patches of > > > the series? > > > > > > According to the comment of this commit log "This should improve > > > performance" there are too many barriers at each packet/segment > > > level to improve something. > > > > > > The point is, mlx4_burst_tx() should write all the WQE without any > > > barrier as it is processing a burst of packets (whereas Verbs > > > functions which may only process a single packet). > > > > > The lonely barrier which should be present is the one to ensure that > > > all the host memory is flushed before triggering the Tx doorbell. > > > > > > > There is a known ConnectX-3 HW limitation: the first 4 bytes of every > > TXWBB (64 bytes chunks) should be > > written in a reversed order (from last TXWBB to first TXWBB). > > This means the first WQE filled by the burst function is the doorbell. > In such situation, the first four bytes of it can be written before > leaving the burst function and after a write memory barrier. > > Until this first WQE is not complete, the NIC won't start processing the > packets. Memory barriers per packets becomes useless. I think this is not true, Since mlx4 HW can prefetch advanced TXbbs if their first 4 bytes are valid in spite of the first WQE is still not valid (please read the spec). > > It gives something like: > > uint32_t tx_bb_db = 0; > void *first_wqe = NULL; > > /* > * Prepare all Packets by writing the WQEs without the 4 first bytes of > * the first WQE. > */ > for () { > if (!wqe) { > first_wqe = wqe; > tx_bb_db = foo; > } > } > /* Leaving. */ > rte_wmb(); > *(uin32_t*)wqe = tx_bb_db; > return n; > I will take care to check if we can do 2 loops: Write all last 60B per TXbb. Memory barrier. Write all first 4B per TXbbs. > > The last 60 bytes of any TXWBB can be written in any order (before > > writing the first 4 bytes). > > Is your last statement (using lonely barrier) is in accordance with > > this limitation? Please explain. > > > > > There is also too many cases handled which are useless in bursts > situation, > > > this function needs to be re-written to its minimal use case i.e. > processing a > > > valid burst of packets/segments and triggering at the end of the burst the > Tx > > > doorbell. > > > > > Regards, > > -- > Nélio Laranjeiro > 6WIND