Excerpts from PrasannaKumar Muralidharan's message of April 5, 2017 11:21:
On 30 March 2017 at 12:46, Naveen N. Rao
<naveen.n....@linux.vnet.ibm.com> wrote:
Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here
are the results:
generic: 0.245315533 seconds time elapsed ( +- 1.83% )
optimized: 0.169282701 seconds time elapsed ( +- 1.96% )
Wondering what makes gcc not to produce efficient assembly code. Can
you please post the disassembly of C implementation of memset64? Just
for info purpose.
It's largely the same as what Christophe posted for powerpc32.
Others will have better insights, but afaics, gcc only seems to be
unrolling the loop with -funroll-loops (which we don't use).
As an aside, it looks like gcc recently picked up an optimization in v7
that can also help (from https://gcc.gnu.org/gcc-7/changes.html):
"A new store merging pass has been added. It merges constant stores to
adjacent memory locations into fewer, wider, stores. It is enabled by
the -fstore-merging option and at the -O2 optimization level or higher
(and -Os)."
- Naveen