Re: [go-nuts] gc: optimize JMP to RET instructions

Arseny Samoylov Wed, 14 Aug 2024 09:46:37 -0700

> Won’t the speculative/parallel execution by most processors make the JMP 
essentially a no-op?
I guess you are right, but this is true when JMP destination already in 
instruction buffer. I guess most of these cases are when JMP leads to RET 
inside on function, so indeed this optimization will have almost zero 
effect. But if RET instruction appears to be far enough, I guess this 
optimization can be meaningful.


On Wednesday 14 August 2024 at 19:40:22 UTC+3 robert engels wrote:

> Won’t the speculative/parallel execution by most processors make the JMP 
> essentially a no-op?
>
> See 
> https://stackoverflow.com/questions/5127833/meaningful-cost-of-the-jump-instruction
>
> On Aug 14, 2024, at 11:31 AM, Arseny Samoylov <samoylo...@gmail.com> 
> wrote:
>
> Thank you for your answer!
>
> > We generally don't do optimizations like that directly on assembly.
> I definitely agree. But this is also a pattern for generated code.
>
> > and concerns about debuggability (can you set a breakpoint on each 
> return in the source?) also matter
> This is an interesting problem that I haven't thought about, thank you!
>
> > That is a JMP to the LDP instruction, not directly to the RET.
> Yes, but on Prog representation it is. I mentioned it when pointed out 
> problem with increasing code size (RET translates to multiple instructions).
>
> >  There's some discussion here https://github.com/golang/go/issues/24936
> I am grateful for the link to the discussion. In this discussion, you 
> mentioned yours abandoned CL 
> <https://github.com/golang/go/issues/24936#issuecomment-383253003>that 
> actually does the contrary of my optimization =).
>
> >  It would need benchmarks demonstrating it is worth it
> Can you please provide some suggestions for benchmarks? I tried bent, but 
> I would like to test on some other benchmarks. 
>
> Thank you in advance!
> On Wednesday 14 August 2024 at 03:59:55 UTC+3 Keith Randall wrote:
>
>> We generally don't do optimizations like that directly on assembly. In 
>> fact, we used to do some like that but they have been removed.
>> We want the generated machine code to faithfully mirror the assembly 
>> input. People writing assembly have all kind of reasons for laying out 
>> instructions in particular ways (better for various caches, etc) that we 
>> don't want to disrupt.
>>
>> If the Go compiler is generating such a pattern, we can optimize that. 
>> There's some discussion here https://github.com/golang/go/issues/24936 
>> but nothing substantive came of it. It would need benchmarks demonstrating 
>> it is worth it, and concerns about debuggability (can you set a breakpoint 
>> on each return in the source?) also matter.
>>
>> > Ps: example of JMP to RET from runtime:
>>
>> That is a JMP to the LDP instruction, not directly to the RET.
>> On Tuesday, August 13, 2024 at 10:10:58 AM UTC-7 Arseny Samoylov wrote:
>>
>>> Hello community, recently I found that gc generates a lot of JMP to RET 
>>> instructions and there is no optimization for that. Consider this example:
>>>
>>> ```
>>> // asm_arm64.s
>>> #include "textflag.h"
>>>
>>>  
>>> TEXT ·jmp_to_ret(SB), NOSPLIT, $0-0
>>>     JMP *ret*
>>> ret:
>>>     *RET*
>>> *```*
>>> This compiles to :
>>> ```
>>> TEXT main.jmp_to_ret.abi0(SB) asm_arm64.s
>>>   asm_arm64.s:4         0x77530                 14000001                
>>> JMP 1(PC)
>>>
>>>   asm_arm64.s:6         0x77534                 d65f03c0                
>>> RET
>>> ```
>>>
>>> Obviously, it can be optimized just to RET instruction.
>>> So I made a patch that replaces JMP to RET with RET instruction (on Prog 
>>> representation):
>>> ```
>>> diff --git a/src/cmd/internal/obj/pass.go b/src/cmd/internal/obj/pass.go
>>> index 066b779539..87f1121641 100644
>>> --- a/src/cmd/internal/obj/pass.go
>>> +++ b/src/cmd/internal/obj/pass.go
>>> @@ -174,8 +174,16 @@ func linkpatch(ctxt *Link, sym *LSym, newprog 
>>> ProgAlloc) {
>>>                         continue
>>>                 }
>>>                 p.To.SetTarget(brloop(p.To.Target()))
>>> -               if p.To.Target() != nil && p.To.Type == TYPE_BRANCH {
>>> -                       p.To.Offset = p.To.Target().Pc
>>> +               if p.To.Target() != nil {
>>> +                       if p.As == AJMP && p.To.Target().As == ARET {
>>> +                               p.As = ARET
>>> +                               p.To = p.To.Target().To
>>> +                               continue
>>> +                       }
>>> +
>>> +                       if p.To.Type == TYPE_BRANCH {
>>> +                               p.To.Offset = p.To.Target().Pc
>>> +                       }
>>>                 }
>>>         }
>>>  }
>>> ```
>>> You can find this patch on my GH 
>>> <https://github.com/ArsenySamoylov/go/tree/obj-linkpatch-jmp-to-ret>.
>>>
>>> I encountered few problems:
>>> * Increase in code size - because RET instruction can translate in 
>>> multiple instructions (ldp, add, and ret - on arm64 for example):
>>> .text section of simple go program that calls function from above 
>>> increases in 0x3D0 bytes; go binary itself increases in 0x2570 (almost 
>>> 10KB) in .text section size 
>>> (this is for arm64 binaries)
>>> * Optimization on Prog representation is too late, and example above 
>>> translates to:
>>> ```
>>> TEXT main.jmp_to_ret.abi0(SB) asm_arm64.s
>>>   asm_arm64.s:4         0x77900                 d65f03c0                
>>> RET
>>>
>>>   asm_arm64.s:6         0x77904                 d65f03c0                
>>> RET
>>> ```
>>> (no dead code elimination was done =( )
>>>
>>> So I am looking for some ideas. Maybe this optimization should be done 
>>> on SSA form and needs some heuristics (to avoid increase in code size).
>>> And also I would like to have suggestion where to benchmark my 
>>> optimization. Bent benchmark is tooooo long =(.
>>>
>>> Ps: example of JMP to RET from runtime:
>>> ```
>>>
>>> TEXT runtime.strequal(SB) a/go/src/runtime/alg.go
>>>
>>> …
>>>
>>>   alg.go:378            0x12eac                 14000004                
>>> JMP 4(PC) // JMP to RET in Prog
>>>
>>>   alg.go:378            0x12eb0                 f9400000                
>>> MOVD (R0), R0
>>>
>>>   alg.go:378            0x12eb4                 f9400021                
>>> MOVD (R1), R1
>>>
>>>   alg.go:378            0x12eb8                 97fffc72                
>>> CALL runtime.memequal(SB)
>>>
>>>   alg.go:378            0x12ebc                 a97ffbfd                
>>> LDP -8(RSP), (R29, R30)
>>>
>>>   alg.go:378            0x12ec0                 9100c3ff                
>>> ADD $48, RSP, RSP
>>>
>>>   alg.go:378            0x12ec4                 d65f03c0                
>>> RET
>>>
>>> ...
>>>
>>> ```
>>>
>>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/golang-nuts/00b5127d-0027-4db0-93db-11f7fe21fb4an%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/golang-nuts/00b5127d-0027-4db0-93db-11f7fe21fb4an%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/1e280aca-1ccc-4aca-9d32-83ecddce50c3n%40googlegroups.com.

Re: [go-nuts] gc: optimize JMP to RET instructions

Reply via email to