https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77610
Bug ID: 77610
Summary: [sh] memcpy is wrongly inlined even for large copies
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: bugdal at aerifal dot cx
Target Milestone: ---
The logic in sh-mem.cc does not suppress inlining of memcpy when the size is
constant but large. This suppresses use of a library memcpy which may be much
faster than the inline version once the threshold of function call overhead is
passed.
At present the reason this is problematic on J2 is that the cache is direct
mapped, so that when source and dest are aligned mod large powers of two
(typical when page-aligned), each write to dest evicts src from the cache,
making memcpy 4-5x slower than it should be. A library memcpy can handle this
by copying cache line size or larger at a time, but the inline memcpy can't.
Even if we have a set-associative cache on J-core in the future, I plan to have
Linux provide a vdso memcpy function that can use DMA transfers, which are
several times faster than what you can achieve with any cpu-driven memcpy and
which free up the cpu for other work. However it's impossible to for such a
function to get called as long as gcc is inlining it.
Using -fno-builtin-memcpy is not desirable because we certainly want inline
memcpy for small transfers that would be dominated by function call time (or
where the actual memory accesses can be optimized out entirely and the copy
performed in registers, like memcpy for type punning), just not for large
copies.