On Wed, Dec 06, 2017 at 11:29:38AM +0000, Matan Azrad wrote: > Hi Adrien > > > -----Original Message----- > > From: Adrien Mazarguil [mailto:adrien.mazarg...@6wind.com] > > Sent: Wednesday, December 6, 2017 12:59 PM > > To: Matan Azrad <ma...@mellanox.com> > > Cc: dev@dpdk.org > > Subject: Re: [PATCH 4/8] net/mlx4: optimize Tx multi-segment case > > > > On Tue, Nov 28, 2017 at 12:19:26PM +0000, Matan Azrad wrote: > > > mlx4 Tx block can handle up to 4 data segments or control segment + up > > > to 3 data segments. The first data segment in each not first Tx block > > > must validate Tx queue wraparound and must use IO memory barrier > > > before writing the byte count. > > > > > > The previous multi-segment code used "for" loop to iterate over all > > > packet segments and separated first Tx block data case by "if" > > > statments. > > > > statments => statements > > > > > > > > Use switch case and unconditional branches instead of "for" loop can > > > optimize the case and prevents the unnecessary checks for each data > > > segment; This hints to compiler to create opitimized jump table. > > > > opitimized => optimized > > > > > > > > Optimize this case by switch case and unconditional branches usage. > > > > > > Signed-off-by: Matan Azrad <ma...@mellanox.com> > > > --- > > > drivers/net/mlx4/mlx4_rxtx.c | 165 > > > +++++++++++++++++++++++++------------------ > > > drivers/net/mlx4/mlx4_rxtx.h | 33 +++++++++ > > > 2 files changed, 128 insertions(+), 70 deletions(-) > > > > > > diff --git a/drivers/net/mlx4/mlx4_rxtx.c <snip> > > > + /* > > > + * Fill the data segments with buffer information. > > > + * First WQE TXBB head segment is always control segment, > > > + * so jump to tail TXBB data segments code for the first > > > + * WQE data segments filling. > > > + */ > > > + goto txbb_tail_segs; > > > +txbb_head_seg: > > > > I'm not fundamentally opposed to "goto" unlike a lot of people out there, > > but this doesn't look good. It's OK to use goto for error cases and to > > extricate > > yourself when trapped in an inner loop, also in some optimization scenarios > > where it sometimes make sense, but not when the same can be achieved > > through standard loop constructs and keywords. > > > > In this case I'm under the impression you could have managed with a do { > > ... } > > while (...) construct. You need to try harder to reorganize these changes or > > prove it can't be done without negatively impacting performance. > > > > Doing so should make this patch shorter as well. > > > > I noticed this could be done with loop and without unconditional branches, > but I checked it and found nice performance improvement using that way. > When I used the loop I degraded performance.
OK, you can leave it as is in the meantime then, we'll keep that in mind for the next refactoring iteration. <snip> > > > a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h index > > > 463df2b..8207232 100644 > > > --- a/drivers/net/mlx4/mlx4_rxtx.h > > > +++ b/drivers/net/mlx4/mlx4_rxtx.h > > > @@ -212,4 +212,37 @@ int mlx4_tx_queue_setup(struct rte_eth_dev > > *dev, uint16_t idx, > > > return mlx4_txq_add_mr(txq, mp, i); > > > } > > > > > > +/** > > > + * Write Tx data segment to the SQ. > > > + * > > > + * @param dseg > > > + * Pointer to data segment in SQ. > > > + * @param lkey > > > + * Memory region lkey. > > > + * @param addr > > > + * Data address. > > > + * @param byte_count > > > + * Big Endian bytes count of the data to send. > > > > Big Endian => Big endian > > > > How about using the dedicated type to properly document it? > > See rte_be32_t from rte_byteorder.h. > > > Learned new something, thanks, will check it. > > > > + */ > > > +static inline void > > > +mlx4_fill_tx_data_seg(volatile struct mlx4_wqe_data_seg *dseg, > > > + uint32_t lkey, uintptr_t addr, uint32_t byte_count) { > > > + dseg->addr = rte_cpu_to_be_64(addr); > > > + dseg->lkey = rte_cpu_to_be_32(lkey); #if RTE_CACHE_LINE_SIZE < > > 64 > > > + /* > > > + * Need a barrier here before writing the byte_count > > > + * fields to make sure that all the data is visible > > > + * before the byte_count field is set. > > > + * Otherwise, if the segment begins a new cacheline, > > > + * the HCA prefetcher could grab the 64-byte chunk and > > > + * get a valid (!= 0xffffffff) byte count but stale > > > + * data, and end up sending the wrong data. > > > + */ > > > + rte_io_wmb(); > > > +#endif /* RTE_CACHE_LINE_SIZE */ > > > + dseg->byte_count = byte_count; > > > +} > > > + > > > > No need to expose this function in a header file. Note that rte_cpu_*() and > > rte_io*() require the inclusion of rte_byteorder.h and rte_atomic.h > > respectively. > > > > Shouldn't inline functions be in header files? Not necessarily, like other function types actually. They have to be declared where needed, in a header file only if several files depend on them, otherwise they bring namespace pollution for no apparent reason. -- Adrien Mazarguil 6WIND