https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66010
--- Comment #2 from vries at gcc dot gnu.org --- Created attachment 35460 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35460&action=edit Demonstrator patch I tracked down which passes are responsible for making the code for f1 and f2 look the same at the optimized dump. pass_dominator rewrites the va_arg memory access to use the same argument as the va_start: ... # .MEM_2 = VDEF <.MEM_1(D)> # USE = nonlocal escaped # CLB = nonlocal escaped { D.1844 } (escaped) __builtin_va_startD.1030 (&apD.1844, 0); # .MEM_3 = VDEF <.MEM_2> apD.1859 = &apD.1844; ap.7_10 = &apD.1844; # VUSE <.MEM_3> _11 = MEM[(struct *)&apD.1844].gp_offsetD.2; ... pass_sra manages to get rid of the vdef on 'ap_3 = &apD.1844': ... # .MEM_2 = VDEF <.MEM_1(D)> # USE = nonlocal escaped # CLB = nonlocal escaped { D.1844 } (escaped) __builtin_va_startD.1030 (&apD.1844, 0); ap_3 = &apD.1844; ap.7_10 = &apD.1844; # VUSE <.MEM_2> _11 = MEM[(struct *)&apD.1844].gp_offsetD.2; ... and pass_dce removes the dead code, leaving us with: ... # .MEM_2 = VDEF <.MEM_1(D)> # USE = nonlocal escaped # CLB = nonlocal escaped { D.1836 } (escaped) __builtin_va_startD.1030 (&apD.1836, 0); # VUSE <.MEM_2> _9 = apD.1836.gp_offsetD.2; ... By squeezing these 3 passes in between the lowering and optimization parts in pass_starg, we manage to get: ... f2: va_list escapes 0, needs to save 8 GPR units and 0 FPR units. ...