Re: [PATCH 1/2] powerpc: string: implement optimized memset variants

Naveen N. Rao Wed, 12 Apr 2017 08:05:49 -0700

Excerpts from PrasannaKumar Muralidharan's message of April 5, 2017 11:21:

On 30 March 2017 at 12:46, Naveen N. Rao
<naveen.n....@linux.vnet.ibm.com> wrote:

Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here
are the results:
generic:        0.245315533 seconds time elapsed        ( +-  1.83% )
optimized:      0.169282701 seconds time elapsed        ( +-  1.96% )


Wondering what makes gcc not to produce efficient assembly code. Can
you please post the disassembly of C implementation of memset64? Just
for info purpose.


It's largely the same as what Christophe posted for powerpc32.

Others will have better insights, but afaics, gcc only seems to beunrolling the loop with -funroll-loops (which we don't use).

As an aside, it looks like gcc recently picked up an optimization in v7that can also help (from https://gcc.gnu.org/gcc-7/changes.html):"A new store merging pass has been added. It merges constant stores toadjacent memory locations into fewer, wider, stores. It is enabled bythe -fstore-merging option and at the -O2 optimization level or higher(and -Os)."



- Naveen

Re: [PATCH 1/2] powerpc: string: implement optimized memset variants

Reply via email to