> -----Original Message----- > From: Jianbo Liu [mailto:jianbo.liu at linaro.org] > Sent: Monday, October 10, 2016 2:58 PM > To: Wang, Zhihong <zhihong.wang at intel.com> > Cc: Yuanhan Liu <yuanhan.liu at linux.intel.com>; Maxime Coquelin > <maxime.coquelin at redhat.com>; dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > On 10 October 2016 at 14:22, Wang, Zhihong <zhihong.wang at intel.com> > wrote: > > > > > >> -----Original Message----- > >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org] > >> Sent: Monday, October 10, 2016 1:32 PM > >> To: Yuanhan Liu <yuanhan.liu at linux.intel.com> > >> Cc: Wang, Zhihong <zhihong.wang at intel.com>; Maxime Coquelin > >> <maxime.coquelin at redhat.com>; dev at dpdk.org > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > >> > >> On 10 October 2016 at 10:44, Yuanhan Liu <yuanhan.liu at linux.intel.com> > >> wrote: > >> > On Sun, Oct 09, 2016 at 12:09:07PM +0000, Wang, Zhihong wrote: > >> >> > > > Tested with testpmd, host: txonly, guest: rxonly > >> >> > > > size (bytes) improvement (%) > >> >> > > > 64 4.12 > >> >> > > > 128 6 > >> >> > > > 256 2.65 > >> >> > > > 512 -1.12 > >> >> > > > 1024 -7.02 > >> >> > > > >> >> > > There is a difference between Zhihong's code and the old I spotted > in > >> >> > > the first time: Zhihong removed the avail_idx prefetch. I > understand > >> >> > > the prefetch becomes a bit tricky when mrg-rx code path is > >> considered; > >> >> > > thus, I didn't comment on that. > >> >> > > > >> >> > > That's one of the difference that, IMO, could drop a regression. I > then > >> >> > > finally got a chance to add it back. > >> >> > > > >> >> > > A rough test shows it improves the performance of 1400B packet > size > >> >> > greatly > >> >> > > in the "txonly in host and rxonly in guest" case: +33% is the number > I > >> get > >> >> > > with my test server (Ivybridge). > >> >> > > >> >> > Thanks Yuanhan! I'll validate this on x86. > >> >> > >> >> Hi Yuanhan, > >> >> > >> >> Seems your code doesn't perform correctly. I write a new version > >> >> of avail idx prefetch but didn't see any perf benefit. > >> >> > >> >> To be honest I doubt the benefit of this idea. The previous mrg_off > >> >> code has this method but doesn't give any benefits. > >> > > >> > Good point. I thought of that before, too. But you know that I made it > >> > in rush, that I didn't think further and test more. > >> > > >> > I looked the code a bit closer this time, and spotted a bug: the prefetch > >> > actually didn't happen, due to following code piece: > >> > > >> > if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) { > >> > prefetch_avail_idx(vq); > >> > ... > >> > } > >> > > >> > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path, > >> > prefetch_avail_idx() will be called. The fix is easy though: just put > >> > prefetch_avail_idx before invoking enqueue_packet. > >> > > >> > In summary, Zhihong is right, I see no more gains with that fix :( > >> > > >> > However, as stated, that's kind of the only difference I found between > >> > yours and the old code, that maybe it's still worthwhile to have a > >> > test on ARM, Jianbo? > >> > > >> I haven't tested it, but I think it could be no improvement for ARM either. > >> > >> A smalll suggestion for enqueue_packet: > >> > >> ..... > >> + /* start copy from mbuf to desc */ > >> + while (mbuf_avail || mbuf->next) { > >> ..... > >> > >> Considering pkt_len is in the first cache line (same as data_len), > >> while next pointer is in the second cache line, > >> is it better to check the total packet len, instead of the last mbuf's > >> next pointer to jump out of while loop and avoid possible cache miss? > > > > Jianbo, > > > > Thanks for the reply! > > > > This idea sounds good, but it won't help the general perf in my > > opinion, since the 2nd cache line is accessed anyway prior in > > virtio_enqueue_offload. > > > Yes, you are right. I'm thinking of prefetching beforehand. > And if it's a chained mbuf, virtio_enqueue_offload will not be called > in next loop. > > > Also this would bring a NULL check when actually access mbuf->next. > > > > BTW, could you please publish the number of: > > > > 1. mrg_rxbuf=on, comparison between original and original + this patch > > > > 2. mrg_rxbuf=off, comparison between original and original + this patch > > > > So we can have a whole picture of how this patch impact on ARM platform. > > > I think you already have got many results in my previous emails. > Sorry I can't test right now and busy with other things.
We're still missing mrg on data.