Thank you for your answer! > We generally don't do optimizations like that directly on assembly. I definitely agree. But this is also a pattern for generated code.
> and concerns about debuggability (can you set a breakpoint on each return in the source?) also matter This is an interesting problem that I haven't thought about, thank you! > That is a JMP to the LDP instruction, not directly to the RET. Yes, but on Prog representation it is. I mentioned it when pointed out problem with increasing code size (RET translates to multiple instructions). > There's some discussion here https://github.com/golang/go/issues/24936 I am grateful for the link to the discussion. In this discussion, you mentioned yours abandoned CL <https://github.com/golang/go/issues/24936#issuecomment-383253003>that actually does the contrary of my optimization =). > It would need benchmarks demonstrating it is worth it Can you please provide some suggestions for benchmarks? I tried bent, but I would like to test on some other benchmarks. Thank you in advance! On Wednesday 14 August 2024 at 03:59:55 UTC+3 Keith Randall wrote: > We generally don't do optimizations like that directly on assembly. In > fact, we used to do some like that but they have been removed. > We want the generated machine code to faithfully mirror the assembly > input. People writing assembly have all kind of reasons for laying out > instructions in particular ways (better for various caches, etc) that we > don't want to disrupt. > > If the Go compiler is generating such a pattern, we can optimize that. > There's some discussion here https://github.com/golang/go/issues/24936 > but nothing substantive came of it. It would need benchmarks demonstrating > it is worth it, and concerns about debuggability (can you set a breakpoint > on each return in the source?) also matter. > > > Ps: example of JMP to RET from runtime: > > That is a JMP to the LDP instruction, not directly to the RET. > On Tuesday, August 13, 2024 at 10:10:58 AM UTC-7 Arseny Samoylov wrote: > >> Hello community, recently I found that gc generates a lot of JMP to RET >> instructions and there is no optimization for that. Consider this example: >> >> ``` >> >> // asm_arm64.s >> >> #include "textflag.h" >> >> >> >> TEXT ·jmp_to_ret(SB), NOSPLIT, $0-0 >> >> JMP *ret* >> >> ret: >> >> *RET* >> >> *```* >> >> This compiles to : >> >> ``` >> >> TEXT main.jmp_to_ret.abi0(SB) asm_arm64.s >> >> asm_arm64.s:4 0x77530 14000001 >> JMP 1(PC) >> >> asm_arm64.s:6 0x77534 d65f03c0 >> RET >> >> ``` >> >> >> Obviously, it can be optimized just to RET instruction. >> >> So I made a patch that replaces JMP to RET with RET instruction (on Prog >> representation): >> >> ``` >> diff --git a/src/cmd/internal/obj/pass.go b/src/cmd/internal/obj/pass.go >> index 066b779539..87f1121641 100644 >> --- a/src/cmd/internal/obj/pass.go >> +++ b/src/cmd/internal/obj/pass.go >> @@ -174,8 +174,16 @@ func linkpatch(ctxt *Link, sym *LSym, newprog >> ProgAlloc) { >> continue >> } >> p.To.SetTarget(brloop(p.To.Target())) >> - if p.To.Target() != nil && p.To.Type == TYPE_BRANCH { >> - p.To.Offset = p.To.Target().Pc >> + if p.To.Target() != nil { >> + if p.As == AJMP && p.To.Target().As == ARET { >> + p.As = ARET >> + p.To = p.To.Target().To >> + continue >> + } >> + >> + if p.To.Type == TYPE_BRANCH { >> + p.To.Offset = p.To.Target().Pc >> + } >> } >> } >> } >> >> ``` >> >> You can find this patch on my GH >> <https://github.com/ArsenySamoylov/go/tree/obj-linkpatch-jmp-to-ret>. >> >> >> I encountered few problems: >> >> * Increase in code size - because RET instruction can translate in >> multiple instructions (ldp, add, and ret - on arm64 for example): >> >> .text section of simple go program that calls function from above >> increases in 0x3D0 bytes; go binary itself increases in 0x2570 (almost >> 10KB) in .text section size >> >> (this is for arm64 binaries) >> >> * Optimization on Prog representation is too late, and example above >> translates to: >> >> ``` >> >> TEXT main.jmp_to_ret.abi0(SB) asm_arm64.s >> >> asm_arm64.s:4 0x77900 d65f03c0 >> RET >> >> asm_arm64.s:6 0x77904 d65f03c0 >> RET >> >> ``` >> >> (no dead code elimination was done =( ) >> >> >> So I am looking for some ideas. Maybe this optimization should be done on >> SSA form and needs some heuristics (to avoid increase in code size). >> >> And also I would like to have suggestion where to benchmark my >> optimization. Bent benchmark is tooooo long =(. >> >> >> Ps: example of JMP to RET from runtime: >> >> ``` >> >> TEXT runtime.strequal(SB) a/go/src/runtime/alg.go >> >> … >> >> alg.go:378 0x12eac 14000004 >> JMP 4(PC) // JMP to RET in Prog >> >> alg.go:378 0x12eb0 f9400000 >> MOVD (R0), R0 >> >> alg.go:378 0x12eb4 f9400021 >> MOVD (R1), R1 >> >> alg.go:378 0x12eb8 97fffc72 >> CALL runtime.memequal(SB) >> >> alg.go:378 0x12ebc a97ffbfd >> LDP -8(RSP), (R29, R30) >> >> alg.go:378 0x12ec0 9100c3ff >> ADD $48, RSP, RSP >> >> alg.go:378 0x12ec4 d65f03c0 >> RET >> >> ... >> >> ``` >> > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/00b5127d-0027-4db0-93db-11f7fe21fb4an%40googlegroups.com.