Builtin expansion versus headers optimization: Reductions

Ondřej Bílka Thu, 04 Jun 2015 03:59:59 -0700

As I commented on libc-alpha list that for string functions a header
expansion is better than builtins from maintainance perspective and also
that a header is lot easier to write and review than doing that in gcc
Jeff said that it belongs to gcc. When I asked about benefits he
couldn't say any so I ask again where are benefits of that?


I split issues, I will cover that most string builtins cause performance
regression and shouldn't be used unless both arguments are constant into
separate thread.

Then there are possibilites to partially expand function to improve
performance.

Here I cover problem of reductions that you should use foo instead bar.
Problem is that these depend on quality of implementation so its
questionable how to check these in gcc.

First issue is memcpy versus mempcpy. Now glibc consensus is that you
should use memcpy and convert mempcpy in header, reason is cache
pressure, it isn't worth to add 1000 bytes to instruction cache just to
save few cycles on spilling extra register.

I could use alternative that avoids it on some architectures
but gcc would need to be aware of that. It would be adding additional
function that returns both start and end, on x64 it would be identical
to memcpy and return end in rdx due calling conventions.

So if you want to exploit return values you would need to use following
function:

struct 
{
void *start;
void *end;
}
memcpcpy (void *, void *, size_t);

Then there is memmove. I have in my todo list to alias memcpy to
memmove, as you don't need to check for n < 128 before loop where you 
do loads before stores and you do check in parallel of preparing loop
so it cost maybe extra cycle.

Then there is second possibility to use generalized tail-call variant
that would get additional parameter with return value. It could be used
internally so question is if converting calls to tailcalls when
only difference is return value is worth effort of compiler pass.

__memcpy_tail (void *dst, void *src, size_t n, void *ret)


A problem with that approach is branch predictability, for memmove you
need to do backward copy 1/3 of times.

A possible optimization would be header check if you could always use
__forward_copy or __backward_copy. A __forward_copy would be aliased to
memcpy but it could confuse gcc. Again compiler would need to be aware
of these.


Then you have memset/bzero. Here on most platforms you should convert every 
memset(x, 0, z) -> __bzero (x, z)

This saves several cycles on creating zero mask and has one less
parameter, now in x64 its 13byte stub that jumps into memset.

If gcc would start calling bzero variant we could make memset stub
instead as its very rare to use c that isn't zero constant.

There is problem that __bzero doesn't return anything while memset does.
We could introduce variant, say __bzero_ret that does. Second
possibility would be add tailcall variant if that would be useful.



We have similar problem with strcpy/stpcpy and that they increase cache
pressure with identical code. To unify these you need to go 
in strcpy => stpcpy direction as getting length again is expensive.
Gcc does opposite conversion.

Also stpcpy is potentially faster than strcpy due to register pressure
as you don't need to save start as return value.

We could merge these with same trick as memcpcpy to return both results.
So which way do you prefer?

Also there could be possible projects to optimize end of strings in gcc.
Now it doesn't recognize that in

char *r = stpcpy (x, y);
size_t l = strlen (x);

it could compute l as r - x. With end information you could do other
optimizations, like optimizing strcat (strcpy (x,y),z) or converting
quadratic strcat loop into linear one.


That would also allow to optimize strfoo(x...) into memfoo(x..., endx - x)
A biggest win would be strrchr => memrchr reduction.


For memcmp problems with expansion are not reduction and covered in
different thread.

Builtins now miss reduction memchr(x, 0, n) => x + strnlen (x, n)
Again its worth add to gcc or just expand that in header?


Then I want to use conversion strchr -> strchrnul. Main benefit is that 
you save one unpredictable branch per call (A strchr finds c 70.8% times 
on profile trace). Reason is that fastest way to
implement strchr is equivalent to

char *ret = strchrnul (s, c); // inline.
return (*ret == c) ? ret : NULL;

As programmer needs to almost always check if strchr is null (and if he
knows it can't happen he should use rawmemchr.) then code would look
like:
if (strchr(s, 'a'))
  ...

expands to

char *ret = strchrnul (s, 'a');
ret = (*ret == 'a') ? ret : NULL;

if (ret)
  ...

If one adds expansion in header/builtin then gcc will simplify that to
following

char *ret = strchrnul (s, 'a'); 

if (*ret == 'a')

A secondary benefit would be icache saving when program uses strchr and
strchrnul which is rare. A transition period would also temporary
increase cache usage.

Then there is missed optimization by gcc for following reduction chain

strchr (x, 0) -> strchrnul (x, 0) -> rawmemchr (x, 0) -> x + strlen (x)

Do you want to add that as separate pass or isn't adding few lines in
header simpler?


Next missed reduction is

strdup (x) => s = strlen (x); memcpy (malloc (s+1), x, s+1)

This leads to suboptimal code as gcc doesn't optimize out

strdup("x")[0] == 'x'

Again is this worth a gcc pass?


Next one is identical to strchrnul that one should expand

strpbrk(x, y) => i = strcspn(x, y); x[i] ? x + i : NULL;

which also reduces icache pressure.

Builtin expansion versus headers optimization: Reductions

Reply via email to