> -----Original Message-----
> From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> Sent: Friday, September 06, 2019 5:12 PM
> To: Liu, Yong <yong....@intel.com>
> Cc: Ilya Maximets <i.maxim...@samsung.com>; Bie, Tiwei
> <tiwei....@intel.com>; maxime.coque...@redhat.com; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1 02/14] vhost: add burst enqueue function
> for packed ring
>
> On Fri, Sep 06, 2019 at 01:42:44AM +0000, Liu, Yong wrote:
> >
> >
> > > -----Original Message-----
> > > From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> > > Sent: Thursday, September 05, 2019 6:31 PM
> > > To: Liu, Yong <yong....@intel.com>; Bie, Tiwei <tiwei....@intel.com>;
> > > maxime.coque...@redhat.com; dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v1 02/14] vhost: add burst enqueue
> function
> > > for packed ring
> > >
> > > On 05.09.2019 19:14, Marvin Liu wrote:
> > > > Burst enqueue function will first check whether descriptors are cache
> > > > aligned. It will also check prerequisites in the beginning. Burst
> > > > enqueue function not support chained mbufs, single packet enqueue
> > > > function will handle it.
> > > >
> > > > Signed-off-by: Marvin Liu <yong....@intel.com>
> > >
> > > Hi.
> > >
> > > Can we rely on loop unrolling by compiler instead of repeating each
> > > command 4 times?
> > >
> > > For example:
> > >
> > > uint64_t len[PACKED_DESCS_BURST];
> > >
> > > for (i = 0; i < PACKED_DESCS_BURST; i++)
> > > len[i] = descs[avail_idx + i].len;
> > >
> > >
> > > For 'if's:
> > >
> > > res = false;
> > > for (i = 0; i < PACKED_DESCS_BURST; i++)
> > > res |= pkts[i]->next != NULL;
> > > if (unlikely(res))
> > > return -1;
> > >
> > > or just
> > >
> > > for (i = 0; i < PACKED_DESCS_BURST; i++)
> > > if (unlikely(pkts[i]->next != NULL))
> > > return -1;
> > >
> > > Since PACKED_DESCS_BURST is a fairly small constant, loops should be
> > > unrolled by compiler producing almost same code.
> > >
> > > This will significantly reduce code size and will also allow to
> > > play with PACKED_DESCS_BURST value without massive code changes.
> > >
> > > Same is applicable to other patches in the series.
> > >
> > > What do you think?
> > >
> >
> > Hi Ilya,
> > I did some test with the unroll availability of various compilers before.
> > All listed compilers will cause loopback performance drop compared to
> repeating code version, especially GCC7.4 and ICC.
> > Newer compilers will have much less impact (around 3%) on the throughput.
> > If we can accept that, repeating code can be replaced with small loop
> function.
> >
> > |----------------|---------------|-------------|------|
> > | Compiler | Auto unrolled | Fixed batch | Gap |
> > |----------------|---------------|-------------|------|
> > | Clang6.0.0 | 13.1M | 13.5M | 0.4M |
> > |----------------|---------------|-------------|------|
> > | GCC 8.3.0 | 13.9M | 14.4M | 0.5M |
> > |----------------|---------------|-------------|------|
> > | GCC 7.4.0 | 12.6M | 13.5M | 0.9M |
> > |----------------|---------------|-------------|------|
> > | ICC 19.0.4.243 | 11.0M | 12.3M | 1.3M |
> > |----------------|---------------|-------------|------|
> >
> > Thanks,
> > Marvin
> >
> Did you verify that the compiler was actually unrolling the loops? You may
> need to put __attribute__((optimize("unroll-loops"))) in the function
> definition.
Thanks for note, Bruce. I only checked GCC compiled binaries, loop have been
unrolled.
Will double check clang and ICC compiling result.
Regards,
Marvin
>
> /Bruce