As I commented on libc-alpha list that for string functions a header expansion is better than builtins from maintainance perspective and also that a header is lot easier to write and review than doing that in gcc Jeff said that it belongs to gcc. When I asked about benefits he couldn't say any so I ask again where are benefits of that?
I split issues, I will cover that most string builtins cause performance regression and shouldn't be used unless both arguments are constant into separate thread. Then there are possibilites to partially expand function to improve performance. Here I cover problem of reductions that you should use foo instead bar. Problem is that these depend on quality of implementation so its questionable how to check these in gcc. First issue is memcpy versus mempcpy. Now glibc consensus is that you should use memcpy and convert mempcpy in header, reason is cache pressure, it isn't worth to add 1000 bytes to instruction cache just to save few cycles on spilling extra register. I could use alternative that avoids it on some architectures but gcc would need to be aware of that. It would be adding additional function that returns both start and end, on x64 it would be identical to memcpy and return end in rdx due calling conventions. So if you want to exploit return values you would need to use following function: struct { void *start; void *end; } memcpcpy (void *, void *, size_t); Then there is memmove. I have in my todo list to alias memcpy to memmove, as you don't need to check for n < 128 before loop where you do loads before stores and you do check in parallel of preparing loop so it cost maybe extra cycle. Then there is second possibility to use generalized tail-call variant that would get additional parameter with return value. It could be used internally so question is if converting calls to tailcalls when only difference is return value is worth effort of compiler pass. __memcpy_tail (void *dst, void *src, size_t n, void *ret) A problem with that approach is branch predictability, for memmove you need to do backward copy 1/3 of times. A possible optimization would be header check if you could always use __forward_copy or __backward_copy. A __forward_copy would be aliased to memcpy but it could confuse gcc. Again compiler would need to be aware of these. Then you have memset/bzero. Here on most platforms you should convert every memset(x, 0, z) -> __bzero (x, z) This saves several cycles on creating zero mask and has one less parameter, now in x64 its 13byte stub that jumps into memset. If gcc would start calling bzero variant we could make memset stub instead as its very rare to use c that isn't zero constant. There is problem that __bzero doesn't return anything while memset does. We could introduce variant, say __bzero_ret that does. Second possibility would be add tailcall variant if that would be useful. We have similar problem with strcpy/stpcpy and that they increase cache pressure with identical code. To unify these you need to go in strcpy => stpcpy direction as getting length again is expensive. Gcc does opposite conversion. Also stpcpy is potentially faster than strcpy due to register pressure as you don't need to save start as return value. We could merge these with same trick as memcpcpy to return both results. So which way do you prefer? Also there could be possible projects to optimize end of strings in gcc. Now it doesn't recognize that in char *r = stpcpy (x, y); size_t l = strlen (x); it could compute l as r - x. With end information you could do other optimizations, like optimizing strcat (strcpy (x,y),z) or converting quadratic strcat loop into linear one. That would also allow to optimize strfoo(x...) into memfoo(x..., endx - x) A biggest win would be strrchr => memrchr reduction. For memcmp problems with expansion are not reduction and covered in different thread. Builtins now miss reduction memchr(x, 0, n) => x + strnlen (x, n) Again its worth add to gcc or just expand that in header? Then I want to use conversion strchr -> strchrnul. Main benefit is that you save one unpredictable branch per call (A strchr finds c 70.8% times on profile trace). Reason is that fastest way to implement strchr is equivalent to char *ret = strchrnul (s, c); // inline. return (*ret == c) ? ret : NULL; As programmer needs to almost always check if strchr is null (and if he knows it can't happen he should use rawmemchr.) then code would look like: if (strchr(s, 'a')) ... expands to char *ret = strchrnul (s, 'a'); ret = (*ret == 'a') ? ret : NULL; if (ret) ... If one adds expansion in header/builtin then gcc will simplify that to following char *ret = strchrnul (s, 'a'); if (*ret == 'a') A secondary benefit would be icache saving when program uses strchr and strchrnul which is rare. A transition period would also temporary increase cache usage. Then there is missed optimization by gcc for following reduction chain strchr (x, 0) -> strchrnul (x, 0) -> rawmemchr (x, 0) -> x + strlen (x) Do you want to add that as separate pass or isn't adding few lines in header simpler? Next missed reduction is strdup (x) => s = strlen (x); memcpy (malloc (s+1), x, s+1) This leads to suboptimal code as gcc doesn't optimize out strdup("x")[0] == 'x' Again is this worth a gcc pass? Next one is identical to strchrnul that one should expand strpbrk(x, y) => i = strcspn(x, y); x[i] ? x + i : NULL; which also reduces icache pressure.