On Thu, May 19, 2016 at 12:18:57PM +0000, Ananyev, Konstantin wrote: > > Hi everyone, > > > On Thu, May 19, 2016 at 12:20:16AM +0530, Jerin Jacob wrote: > > > On Wed, May 18, 2016 at 05:43:00PM +0100, Bruce Richardson wrote: > > > > On Wed, May 18, 2016 at 07:27:43PM +0530, Jerin Jacob wrote: > > > > > To avoid multiple stores on fast path, Ethernet drivers > > > > > aggregate the writes to data_off, refcnt, nb_segs and port > > > > > to an uint64_t data and write the data in one shot > > > > > with uint64_t* at &mbuf->rearm_data address. > > > > > > > > > > Some of the non-IA platforms have store operation overhead > > > > > if the store address is not naturally aligned.This patch > > > > > fixes the performance issue on those targets. > > > > > > > > > > Signed-off-by: Jerin Jacob <jerin.jacob at caviumnetworks.com> > > > > > --- > > > > > > > > > > Tested this patch on IA and non-IA(ThunderX) platforms. > > > > > This patch shows 400Kpps/core improvement on ThunderX + ixgbe + > > > > > vector environment. > > > > > and this patch does not have any overhead on IA platform. > > > > > > > > > > Have tried an another similar approach by replacing "buf_len" with > > > > > "pad" > > > > > (in this patch context), > > > > > Since it has additional overhead on read and then mask to keep > > > > > "buf_len" intact, > > > > > not much improvement is not shown. > > > > > ref: http://dpdk.org/ml/archives/dev/2016-May/038914.html > > > > > > > > > > --- > > > > While this will work and from your tests doesn't seem to have a > > > > performance > > > > impact, I'm not sure I particularly like it. It's extending out the end > > > > of > > > > cacheline0 of the mbuf by 16 bytes, though I suppose it's not > > > > technically using > > > > up any more space of it. > > > > > > Extending by 2 bytes. Right ?. Yes, I guess, Now we using only 56 out of > > > 64 bytes > > > in the first 64-byte cache line. > > > > > > > > > > > What I'm wondering about though, is do we have any usecases where we > > > > need a > > > > variable buf_len for packets for RX. These mbufs come directly from a > > > > mempool, > > > > which is generally understood to be a set of fixed-sized buffers. I > > > > realise that > > > > this change was made in the past after some discussion, but one of the > > > > key points > > > > there [at least to my reading] was that - even though nobody actually > > > > made a > > > > concrete case where they had variable-sized buffers - having support > > > > for them > > > > made no performance difference. > > I was going to point to vhost zcp support, but as Thomas pointed out > that functionality was removed from dpdk.org recently. > So I am not aware does such case exist right now in the 'real world' or not. > Though I still think RX function should leave buf_len field intact. > > > > > > > > > The latter part of that has now changed, and supporting variable-sized > > > > mbufs > > > > from an mbuf pool has a perf impact. Do we definitely need that > > > > functionality, > > > > because the easiest fix here is just to move the rxrearm marker back > > > > above > > > > mbuf_len as it was originally in releases like 1.8? > > > > > > And initialize the buf_len with mp->elt_size - sizeof(struct rte_mbuf). > > > Right? > > > > > > I don't have a strong opinion on this, I can do this if there is no > > > objection on this. Let me know. > > > > > > However, I do see in future, "buf_len" may belong at the end of the first > > > 64 byte > > > cache line as currently "port" is defined as uint8_t, IMO, that is less. > > > We may need to increase that uint16_t. The reason why I think that > > > because, Currently in ThunderX HW, we do have 128VFs per socket for > > > built-in NIC, So, the two node configuration and one external PCIe NW card > > > configuration can easily go beyond 256 ports. > > I wonder does anyone really use mbuf port field? > My though was - could we to drop it completely? > Actually, after discussing it with Bruce offline, an interesting idea came > out: > if we'll drop port and make mbuf_prefree() to reset nb_segs=1, then > we can reduce RX rearm_data to 4B. So with that layout: > > struct rte_mbuf { > > MARKER cacheline0; > > void *buf_addr; > phys_addr_t buf_physaddr; > uint16_t buf_len; > uint8_t nb_segs; > uint8_t reserved_1byte; /* former port */ > > MARKER32 rearm_data; > uint16_t data_off; > uint16_t refcnt; > > uint64_t ol_flags; > ... > > We can keep buf_len at its place and avoid 2B gap, while making rearm_data > 4B long and 4B aligned.
Couple of comments, - IMO, It is good if nb_segs can move under rearm_data, as some drivers(not in ixgbe may be) can write nb_segs in one shot also in segmented rx handler case - I think, it makes sense to keep port in mbuf so that application can make use of it(Not sure what real application developers think of this) - if Writing 4B and 8B consume same cycles(at least in arm64) then I think it makes sense to make it as 8B wide with maximum pre-built constants are possible. > > Another similar alternative, is to make mbuf_prefree() to set refcnt=1 > (as it update it anyway). Then we can remove refcnt from the RX rearm_data, > and again make rearm_data 4B long and 4B aligned: > > struct rte_mbuf { > > MARKER cacheline0; > > void *buf_addr; > phys_addr_t buf_physaddr; > uint16_t buf_len; > uint16_t refcnt; > > MARKER32 rearm_data; > uint16_t data_off; > uint8_t nb_segs; > uint8_t port; The only problem I think with this approach is that, port data type cannot be extended to uint16_t in future. > > uint64_t ol_flags; > .. > > As additional plus, __rte_mbuf_raw_alloc() wouldn't need to modify mbuf > contents at all - > which probably is a good thing. > As a drawback - we'll have a free mbufs in pool with refcnt==1, which > probably reduce > debug ability of the mbuf code. > > Konstantin > > > > > > Ok, good point. If you think it's needed, and if we are changing the mbuf > > structure, it might be a good time to extend that field while you are at > > it, save > > a second ABI break later on. > > > > /Bruce > > > > > > > > > > Regards, > > > > /Bruce > > > > > > > > Ref: http://dpdk.org/ml/archives/dev/2014-December/009432.html > > > >