Jakub Jelinek wrote: > On Thu, Apr 12, 2018 at 03:52:09PM +0200, Richard Biener wrote: >> Not sure if I missed some important part of the discussion but >> for the testcase we want to preserve the tailcall, right? So >> it would be enough to set avoid_libcall to >> endp != 0 && CALL_EXPR_TAILCALL (exp) (and thus also handle >> stpcpy)?
The tailcall issue is just a distraction. Historically the handling of mempcpy has been horribly inefficient in both GCC and GLIBC for practically all targets. This is why it was decided to defer to memcpy. For example small constant mempcpy was not expanded inline like memcpy until PR70140 was fixed. Except for a few targets which have added an optimized mempcpy, the default mempcpy implementation in almost all released GLIBCs is much slower than memcpy (due to using a badly written C implementation). Recent GLIBCs now call the optimized memcpy - this is better but still adds extra call/return overheads. So to improve that the GLIBC headers have an inline that changes any call to mempcpy into memcpy (this is the default but can be disabled on a per-target basis). Obviously it is best to do this optimization in GCC, which is what we finally do in GCC8. Inlining mempcpy means you sometimes miss a tailcall, but this is not common - in all of GLIBC the inlining on AArch64 adds 166 extra instructions and 12 callee-save registers. This is a small codesize cost to avoid the overhead of calling the generic C version. > My preference would be to have non-lame mempcpy etc. on all targets, but the > aarch64 folks disagree. The question is who is going to write the 30+ mempcpy implementations for all those targets which don't have one? And who says doing this is actually going to improve performance? Having mempcpy+memcpy typically means more Icache misses in code that uses both. So generally it's a good idea to change mempcpy into memcpy by default. It's not slower than calling mempcpy even if you have a fast implementation, it's faster if you use an up to date GLIBC which calls memcpy, and it's significantly better when using an old GLIBC. Wilco