https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88751
Bug ID: 88751
Summary: Performance regression reload vs lra
Product: gcc
Version: 9.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: krebbel at gcc dot gnu.org
Target Milestone: ---
There is a big performance drop in OpenJ9 after they have updated from GCC
4.8.5 to GCC 7.3.0.
- The performance regression disappears after compiling the byte code
interpreter loop with -mno-lra.
https://github.com/eclipse/openj9/blob/master/runtime/vm/BytecodeInterpreter.hpp
- The problem comes from the frequently accessed _pc and _sp variables being
assigned to stack slots instead of registers. With GCC 4.8 both variables end
up in hard regs.
- The problem can be seen on x86 as well as on S/390.
- In LRA the root cause of the problem is a threshold which prevents LRA from
running the full register coloring step (ira.c):
/* If there are too many pseudos and/or basic blocks (e.g. 10K
pseudos and 10K blocks or 100K pseudos and 1K blocks), we will
use simplified and faster algorithms in LRA. */
lra_simple_p = (ira_use_lra_p && max_reg_num () >= (1 << 26) /
last_basic_block_for_fn (cfun));
For the huge run() function in the byte code interpreter the numbers are:
(gdb) p max_reg_num()
$6 = 27089
(gdb) p last_basic_block_for_fn(cfun)
$7 = 4799
Forcing GCC to run the full coloring pass makes the _pc and _sp variables to
get hard regs assigned again.
As a quick workaround we might want to turn this threshold into a parameter.
Long-term it would be good if we could either enable the heuristic to estimate
whether full coloring would be beneficial or improve the fallback coloring to
cover such important cases.