Hi everyone, I am writing the internals for the modulo scheduler, that I will join together with my incoming serie of patches. I would like your opinion on the attached texinfo file, and are these utf-8 box characters legal for our doc ?
It is still a work in progress. I have focused on detailing what will always be true. See attached file. Use texi2any --html modulo-scheduler.texi --no-split -c HTML_MATH=mathjax -o modulo-scheduler.html Benjamin
@c Copyright (C) 2019-2024 Free Software Foundation, Inc. @c This is part of the GCC manual. @c For copying conditions, see the file gcc.texi. @c Contributed by Benjamin Priour <vultk...@gcc.gnu.org>. @title Modulo Scheduler Internals @node Modulo Scheduler @chapter Modulo Scheduler @cindex SMS @cindex modulo scheduler @menu * Modulo Scheduler Overview:: Overview * Pruned DDG:: How to make SMS more likely to succeed * SMS interoperability:: Interoperability with other passes * Debugging the Modulo Scheduler:: Useful debugging tips @end menu @node Modulo Scheduler Overview @section Modulo Scheduler Overview @cindex modulo scheduler @cindex SMS @subsubheading Background @enumerate @item J. Llosa, A. Gonzalez, E. Ayguade, M. Valero., and J. Eckhardt. @cite{@url{https://ieeexplore.ieee.org/document/910814, Lifetime-sensitive modulo scheduling in a production environment}}. @item J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero. @cite{@url{https://www.researchgate.net/publication/2941400_Swing_Modulo_Scheduling_A_Lifetime-Sensitive_Approach, Swing Modulo Scheduling: A Lifetime Sensitive Approach}}. @item M. Hagog and A. Zaks. @cite{@url{https://www.researchgate.net/publication/228937160_Swing_modulo_scheduling_for_gcc, Swing modulo scheduling for gcc}}. @end enumerate @subsubheading Related Options @itemize @bullet @item @option{-fmodulo-sched} Turns on modulo scheduling optimization. @item @option{--param=sms-dfa-history=} @item @option{--param=sms-loop-average-count-threshold=} @item @option{--param=sms-max-ii-factor=} @item @option{--param=sms-min-sc=} @end itemize @subsection Introduction @example The original loop of length 11 cycles, with many stalled cycles (i.e. cycles without any resource used). @t{Iteration N} ╓─────────────────╥─────────────────╥─────────────────╖ ║ @b{LSU} ║ @b{FPU} ║ @b{Other} ║ ╠═════════════════╬═════════════════╬═════════════════╣ ║ v1 = [x4*8 +x1] ║ @i{wasted} ║ @i{wasted} ║ ╠═════════════════╬═════════════════╬═════════════════╣ ║ v2 = [x4*8 +x2] ║ @i{wasted} ║ @i{wasted} ║ ╠═════════════════╬═════════════════╬═════════════════╣ ║ @i{CYCLE} @i{IS} @i{STALLED} ║ ╠═════════════════╬═════════════════╬═════════════════╣ ║ @i{wasted} ║ v1 = v1*v0 + v2 ║ @i{wasted} ║ ╠═════════════════╬═════════════════╬═════════════════╣ ║ @i{CYCLE} @i{IS} @i{STALLED} ║ ╠═════════════════╬═════════════════╬═════════════════╣ ║ @i{CYCLE} @i{IS} @i{STALLED} ║ ╠═════════════════╬═════════════════╬═════════════════╣ ║ @i{CYCLE} @i{IS} @i{STALLED} ║ ╠═════════════════╬═════════════════╬═════════════════╣ ║ @i{CYCLE} @i{IS} @i{STALLED} ║ ╠═════════════════╬═════════════════╬═════════════════╣ ║ [x4*8 + x3] = v1║ @i{wasted} ║ x4 = x4 + 1 ║ ╠═════════════════╬═════════════════╬═════════════════╣ ║ @i{wasted} ║ @i{wasted} ║ cc = cmp(x0, x4)║ ╠═════════════════╬═════════════════╬═════════════════╣ ║ @i{wasted} ║ @i{wasted} ║ jump ║ ╙─────────────────╨─────────────────╨─────────────────╜ @t{Iteration N} Visualize that we slide successive iterations ╓────╥────╥────╖ each II cycles apart to build a new kernel. ║ @b{LSU}║ @b{FPU}║ @b{Oth}║ ╠════╬════╬════╣ ║ i1 ║ ║ ║ ╠════╬════╬════╣ ║ i2 ║ ║ ║ @t{Iteration N+1} ╠════╬════╬════╣╓────╥────╥────╖ ║ ║║ @b{LSU}║ @b{FPU}║ @b{Oth}║ ╠════╬════╬════╣╠════╬════╬════╣ ║ ║ i3 ║ ║║ i1 ║ ║ ║ ╠════╬════╬════╣╠════╬════╬════╣ ║ ║║ i2 ║ ║ ║ @t{Iteration N+2} ╠════╬════╬════╣╠════╬════╬════╣╓────╥────╥────╖ ║ ║║ ║║ @b{LSU}║ @b{FPU}║ @b{Oth}║ ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣ ║ ║║ ║ i3 ║ ║║ i1 ║ ║ ║ ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣ ║ ║║ ║║ i2 ║ ║ ║ @t{Iteration N+3} ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣╓────╥────╥────╖ ║ i4 ║ ║ i5 ║║ ║║ ║║ @b{LSU}║ @b{FPU}║ @b{Oth}║ ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣ ║ ║ ║ i6 ║║ ║║ ║ i3 ║ ║║ i1 ║ ║ ║ ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣ ║ ║ ║jump║║ ║║ ║║ i2 ║ ║ ║ ┅┅╙┅┅┅┅╨┅┅┅┅╨┅┅┅┅╜╠┅┅┅┅╬┅┅┅┅╬┅┅┅┅╣╠┅┅┅┅╬┅┅┅┅╬┅┅┅┅╣╠┅┅┅┅╬┅┅┅┅╬┅┅┅┅╣┅┅ N ENDS ┅┅ ║ i4 ║ ║ i5 ║║ ║║ ║ ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣ ║ ║ ║ i6 ║║ ║║ ║ i3 ║ ║ ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣ ║ ║ ║jump║║ ║║ ║ ╙────╨────╨────╜╠════╬════╬════╣╠════╬════╬════╣ ║ i4 ║ ║ i5 ║║ ║ ╠════╬════╬════╣╠════╬════╬════╣ ║ ║ ║ i6 ║║ ║ ╠════╬════╬════╣╠════╬════╬════╣ ║ ║ ║jump║║ ║ ╙────╨────╨────╜╠════╬════╬════╣ Iteration N has ended, there is ║ i4 ║ ║ i5 ║ no point to stack yet another iteration. ╠════╬════╬════╣ ║ ║ ║ i6 ║ ╠════╬════╬════╣ ║ ║ ║jump║ ╙────╨────╨────╜ @t{Iteration N} ╓────╥────╥────╖ ║ @b{LSU}║ @b{FPU}║ @b{Oth}║ ╠════╬════╬════╣━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ┓ ║ i1 ║ ║ ║ Actually we have found a new kernel ╠════╬════╬════╣ containing exactly once every instruction ║ i2 ║ ║ ║ @t{Iteration N+1} without any resource conflicts. ╠════╬════╬════╣╓────╥────╥────╖ ┃ ║ ║║ @b{LSU}║ @b{FPU}║ @b{Oth}║ ┃ ╠════╬════╬════╣╠════╬════╬════╣ ┃ ║ ║ i3 ║ ║║ i1 ║ ║ ║ ┃@b{PROLOG} ╠════╬════╬════╣╠════╬════╬════╣ ┃ ║ ║║ i2 ║ ║ ║ @t{Iteration N+2} ┃ ╠════╬════╬════╣╠════╬════╬════╣╓────╥────╥────╖ ┃ ║ ║║ ║║ @b{LSU}║ @b{FPU}║ @b{Oth}║ ┃ ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣ ┃ ║ ║║ ║ i3 ║ ║║ i1 ║ ║ ║ ┃ ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣━ ━ ━ ━ ━ ━ ━ ━ ┛ ║ ║║ ║║ i2 ║ ║ ║ @t{Iteration N+3} ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣╓────╥────╥────╖ ┏╠━━━━╬━━━━╬━━━━╣╠━━━━╬━━━━╬━━━━╣╠━━━━╬━━━━╬━━━━╣╠━━━━╬━━━━╬━━━━╣┓ ┃║ i4 ║ ║ i5 ║║ ║║ ║║ @b{LSU}║ @b{FPU}║ @b{Oth}║┃ ┃╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣┃ ┃║ ║ ║ i6 ║║ ║║ ║ i3 ║ ║║ i1 ║ ║ ║┃@b{KERNEL} ┃╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣┃ ┃║ ║ ║jump║║ ║║ ║║ i2 ║ ║ ║┃ ┗╙━━━━╨━━━━╨━━━━╜╠━━━━╬━━━━╬━━━━╣╠━━━━╬━━━━╬━━━━╣╠━━━━╬━━━━╬━━━━╣┛ ┏━ ━ ━ ━ ━ ━ ━ ━ ║ i4 ║ ║ i5 ║║ ║║ ║ ┃ ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣ ┃ @b{EPILOG} ║ ║ ║ i6 ║║ ║║ ║ i3 ║ ║ ┃ ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣ ┃ ║ ║ ║jump║║ ║║ ║ ┃ ╙────╨────╨────╜╠════╬════╬════╣╠════╬════╬════╣ ┃ ║ i4 ║ ║ i5 ║║ ║ ┃╓────╥────╥────╖ ╠════╬════╬════╣╠════╬════╬════╣ ┃║ @b{LSU}║ @b{FPU}║ @b{Oth}║ ║ ║ ║ i6 ║║ ║ ┃╠════╬════╬════╣ ╠════╬════╬════╣╠════╬════╬════╣ ┃┃ i4 ┃ ┃ i5 ┃ ║ ║ ║jump║║ ║ ┃┣━━━━╋━━━━╋━━━━┫ ╙────╨────╨────╜╠════╬════╬════╣ ┃┃ i1 ┃ i3 ┃ i6 ┃ II=3 success! ║ i4 ║ ║ i5 ║ ┃┣━━━━╋━━━━╋━━━━┫ ╠════╬════╬════╣ ┃┃ i2 ┃ ┃jump┃ ║ ║ ║ i6 ║ ┃┖────┸────┸────┚ ╠════╬════╬════╣ ┃ ║ ║ ║jump║ ┗━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ╙────╨────╨────╜ @end example Modulo scheduling is a software pipelining technique. Contrary to an acyclic scheduling algorithm such as list scheduling, a modulo scheduler considers inter-iterations dependencies to produce its schedule. Instead, modulo scheduling is a @emph{global} (goes beyond basic block boundaries) @emph{cyclic} (processes back edges) scheduling technique which aims to increase the throughput of the innermost loops. Modulo scheduling attempts to combine multiple successive iterations into a new smaller kernel of length @strong{@acronym{II, Initiation Interval}} cycles, such that the hardware resources (ALU, LSU...) are reused modulo II. A modulo schedule is often represented as a table, where each row represents a cycle in [0, II-1], and each column represents an execution unit. A modulo scheduler works on a @ref{ddg-def,,Data Dependence Graph}. SMS's workflow is given below. @smallexample Create DDG for loop. ┃ v Compute MII = Max(RecMII, ResMII) and MaxII, set II = MII ┃ ┃<━━━━━━━━━━━ II++ ━━━━━━━━━━━━┓ v ┃ Schedule nodes from ┃ time 0 to infinity. ┃ ┃ ┃ ┠━━━━ If failed in ━━━━━━━━━━━━┨ ┃ scheduling all insns ┃ ┃ ┃ If success ┃ ┃ ┃ ┃ ┃ v ┃ Resolve live ranges overflows ┃ with regmoves or MVE. ┃ ┃ ┃ ┃ ┃ ┠━━━ If failed to insert ━━━━━━┛ ┃ register copies into ┃ schedule (MVE: if KUF ┃ too large) ┃ ┃ If we succeeded in scheduling the loop in II cycles ┃ v Generate prolog and epilog and decrease the counter of the loop. If runtime counter, do loop versioning. @end smallexample @subsubsection Limitations on loops SMS works with countable loops (1) whose control part can be easily decoupled from the rest of the loop and (2) whose loop count can be easily adjusted. This is because we peel a constant number of iterations into a prologue and epilogue for which we want to avoid emitting the control part, and a kernel which is to iterate that constant number of iterations less than the original loop. So the control part should be a set of insns clearly identified and having its own iv, not otherwise used in the loop (at-least for now), which initializes a register before the loop to the number of iterations. Currently SMS relies on the do-loop pattern to recognize such loops, where (1) the control part comprises of all insns defining and/or using a certain 'count' register and (2) the loop count can be adjusted by modifying this register prior to the loop. @subsection The Swing algorithm The modulo scheduler within GCC uses the Swing algorithm, which is a near optimal modulo scheduling algorithm (see [1,2]). It has been implemented in 2004 (see [3]) and has only received rare attention since then. The algorithm is further described as a comment somewhere in @file{modulo-sched.cc}. All these ressources should suffice the reader to understand the Swing algorithm, without another repeat here. @subsubsection @acronym{ResMII, Resource Minimum Initiation Interval} @subsubsection @acronym{RecMII, Recurrence Minimum Initiation Interval} @node Pruned DDG @section How to make SMS more likely to succeed @cindex modulo scheduler, pruned DDG @cindex SMS, pruned DDG @subsubheading Options @itemize @bullet @item @option{-fmodulo-sched-allow-regmoves} Allow SMS to use a pruned DDG and insert register copies to fix overlaps. @item @option{-fmodulo-sched-allow-mve} Allow SMS to use a pruned DDG and unroll the kernel to fix overlaps. @end itemize @anchor{ddg-def} A @acronym{DDG, Data-Dependence Graph} is a directed graph whose nodes are operations and arcs are data dependencies. A data dependence may be one of 3 types: a @acronym{RAW, Read-after-Write} aka TRUE, a @acronym{WAR, Write-after-Read} aka ANTI or a @acronym{WAW, Write-after-Write} aka OUTPUT. The DDG of a loop is always cyclic, because for each RAW arc @math{(x,y)} of cost @code{latency(x)} there is an equivalent WAR arc @math{(y,x)} of cost @code{latency(y)}, simply because if at iteration N @math{x_N} comes before @math{y_N}, then this @math{y_N} comes before @math{x_{N+1}}, which inverts the dependency. @quotation REgister lifetime The lifetime of a register is defined as the duration between the first assignment into the register and its last use which may be in an ulterior iteration. If the lifetime of a variable is @math{l}, and an iteration is initiated every @acronym{II} cycles, then at least @math{r_@{i,min@} = \lceil \frac@{l@}@{II@}\rceil} versions (values) of that variable must be kept alive simultaneously at all time. @end quotation @smallexample ━>: Read-after-Write dep ┅>: Write-after-Read dep ┈>: Write-after-Write dep ╭────╮ │ i1 │ ╰────╯ ┃^ v┋ ╭────╮ │ i2 │ ╰────╯ ^┃ ┃^ ┏┅┅┅┛┃ ┃┗┅┅┅┓ ┋┏━━━┛ ┗━━━┓┋ ┋┃ ┃┋ ┋v v┋ ╭────╮ ╭────╮ │ i3 │ │ i4 │ ╰────╯ ╰────╯ @end smallexample Having either arc orders the two nodes relatively to one another. It indicates which one must comes before the other. The second arc constrains the distance between them. This distance constraints guarantees that upon successful schedule at II, every register liverange is smaller than II cycles. However, this more constrained DDG makes scheduling more difficult for SMS, and it fails more often. The solution therefore is to keep only one arc (RAW) where there is two. Doing so guarantees a sound ordering, but some live range may no longer fit into II cycles. For these, either insert register copies (if regmoves) or unroll the kernel (if @acronym{MVE, Modulo Variable Expansion}) to prevent a definition from overwriting itself before reaching the use. @smallexample ━>: Read-after-Write dep ┅>: Write-after-Read dep ┈>: Write-after-Write dep ╭────╮ │ i1 │ ╰────╯ ┃ v ╭────╮ │ i2 │ ╰────╯ ┃ ┃ ┃ ┃ ┏━━━┛ ┗━━━┓ ┃ ┃ v v ╭────╮ ╭────╮ │ i3 │ │ i4 │ ╰────╯ ╰────╯ @end smallexample @subsection Register to register moves Part of the original 2004 implementation as an approximation to Modulo Variable Expansion. Enabled by @option{-fmodulo-sched-allow-regmoves}, this will build a pruned DDG. Then once the DDG's nodes are scheduled, count the number or registers @subsubsection Limitations Failure to insert the moves in schedule. Long-latency synergy. Increase compile time in case of failure. @subsection Modulo Variable Expansion @cindex MVE @acronym{MVE, Modulo Variable Expansion} is similar to the variable expansion technique but comparatively reduces the number of locations allocated to a register by reusing the same location in non-overlapping iterations. Therefore the minimum number of extra registers for each redefined variable can be lower than the unroll factor. It is an alternative to @option{-fmodulo-sched-allow-regmoves} that performs better in presence of long-latency moves, or when the loop is small (in terms of its number of instructions) and contains any long-latency operations. In any other cases, except for performance bugs cited reported below, it performs similarly to 'regmoves'. This option is turned on with @option{-fmodulo-sched-allow-mve}. @subsubheading Algorithm @enumerate @item Pretend that every iteration of the loop has a dedicated register location for each variable, and remove all inter-iteration precedence constraints. @item Modulo schedule as usual. @item Use the schedule thereafter obtained to calculate the lifetime of all variables using the method explained below. This step will also determine the minimum @acronym{KUF, Kernel Unroll Factor}. @item Increment @math{R_i} the number of register allocated to each variable from @math{r_{i,min}} to a factor of @acronym{KUF}. @item Unroll the loop @acronym{KUF} times to obtain a new larger kernel made of @acronym{KUF} successive copies of the original kernel. @item Modulo rename each variable across the prologue, epilogue and kernels. @item Generate register copies after the epilogue to preserve correctness on live-out variables. @end enumerate @subsubsection Kernel Unroll Factor The minimum unrolling factor is simply @math{\mathsf@{KUF@} = \max_i r_@{i,min@}}. It can be achieved by setting the effective number of registers @math{R_i} (original register plus extras) allocated to variable @math{v_i} to be the smallest factor of @acronym{KUF} that is no smaller than @math{r_@{i,min@}}, i.e., @displaymath R_i = \min\left\{r_i \;\vert\; (r_i \geq r_{i,min}) \land (r_i \bmod \mathsf{KUF}=0)\right\} @end displaymath In total there will be $\sum (R_i - 1)$ extra registers, and by construction @math{r_@{max@} = \max R_i = \mathsf@{KUF@}}. The GCC currently assumes that the loop consists of a single basic block without any early exits or control structures (such as @t{if} statements). This characteristic allows for the implementation of a straightforward loop unrolling algorithm, which involves duplicating all instructions in the loop except for the jump instruction @acronym{KUF} times. However, care must be taken to accurately recreate the liveness metadata of the duplicated instructions. This is possible because none of the instructions, other than the jump instruction which is not duplicated, utilize or define the iteration counter, eliminating the need for additional adjustments. @subsubsection Register Requirements @cartouche @b{Stage Variant} Let @math{\mathcal@{R@}_@{v_i@}} be the non-empty set of @math{r_i} distinct registers used by variable @math{v_i}. Let @math{r^j_i, 1 \leq j \leq r_i} be the elements of that set and @math{r^0_i} is the register found in the original non-renamed instruction. When a stage @math{g} is said to be in variant @math{k}, it means that for every instruction @t{insn} of that stage, if @t{insn} redefines the variable @math{v_i} with value @math{x}, then it stores this value @math{x} into register @math{r^@{k \bmod r_i@}_i} of @math{\mathcal@{R@}_@{v_i@}}. @end cartouche All instructions that @b{use} variable @math{v_i} between its last definition as variant @math{k} and before its next definition in variant @math{(k + 1) \bmod KUF} will use register @math{r^@{k \bmod r_i@}_i}. Furthermore in a given row of the unrolled schedule, not all stages are in the same variant yet all stage variants increment by one from a schedule row to the next. @quotation Remark From the above follows that for instructions that both define and use the same variable @math{v_i} (i.e. of the form @math{v_i = f(v_i)}), the renaming to variant @math{k} becomes @math{r^k_i = f(r^@{k - 1 \bmod r_i@}_i)}. @end quotation Some variable (re-)defined within the non-pipelined loop may be still used after the loop. It is therefore crucial to add compensation register copies after the epilogue so that the latest updates on the live-out variables are not lost within one of their extra register, but rather reside in their original register. The choice has been made to define the first appearance of each stage in the modulo schedule (i.e. the first non-empty cell of each column in the @ref{tab:mve-scheme,,MVE Scheme}) to be in variant 1. This allows for a natural filling of the pipeline without prior register moves, since all uses-before-def of each variable @math{v_i} will use their original register @math{r^0_i}. The code generation schema to apply @acronym{MVE, Modulo Variable Expansion} is presented in @ref{tab:mve-scheme,,MVE Scheme}. @anchor{tab:mve-scheme} @example With @code{SC=4} stages and @code{KUF=4} unroll factor ╓──────────╥───────────────────╖ ║ @b{Schedule} ║ @b{Stage Variant} ║ ║ @b{Step} ║ ║ ╠══════════╬═══════════════════╣ ║ ║ @b{STAGE} ║ ║ ║000 001 002 003 ║ ║ ╓─╨──────────╖ ║ ╓──────────╥───────────────────╖ ╠════════╣@t{VERSION LOOP}╠════════╣ ║ @b{Schedule} ║ @b{Stage Variant} ║ ║ ╙────────────╜ ║ ║ @b{Step} ║ ║ ║ Adjust loop counter by SC ║ ╠══════════╬═══════════════════╣ ║ or version if runtime ║ ║ ║ @b{STAGE} ║ ║ counter. ║ ║ ║000 001 002 003 ║ ║ ╓───────────────╖ ║ ║ ╓─╨──────────╖ ║ ╠═══════╣@t{PRECONDITIONING}╠══════╣ ╠════════╣@t{VERSION LOOP}╠════════╣ ║ ╙───────────────╜ ║ ║ ╙────────────╜ ║ ║ Adjust loop counter by KUF ║ ║ Adjust loop counter by SC ║ ║ or make Duff's device with ║ ║ or version if runtime ║ ║ non-pipelined loop body. ║ ║ counter. ║ ║ ╓──────╖ ║ ║ ╓──────╖ ║ ╠═══════════╣@t{PROLOG}╠═══════════╣ ╠═══════════╣@t{PROLOG}╠═══════════╣ ║ ╙──────╜ ║ ║ ╙──────╜ ║ ║001 1 ║ ║001 0 ║ ║002 2 1 ║ ║002 0 0 ║ ║003 3 2 1 ║ ║003 0 0 0 ║ ║ ╓──────╖ ║ ║ ╓──────╖ ║ ╠═══════════╣@t{KERNEL}╠═══════════╣ ╠═══════════╣@t{KERNEL}╠═══════════╣ ║ ╙──────╜ ║ ║ ╙──────╜ ║ ║004 0 3 2 1 ║ ║004 0 0 0 0 ║ ║005 1 0 3 2 ║ ║ ╓──────╖ ║ ║006 2 1 0 3 ║ ╠═══════════╣@t{EPILOG}╠═══════════╣ ║007 3 2 1 0 ║ ║ ╙──────╜ ║ ║ ╓──────╖ ║ ║005 0 0 0 ║ ╠═══════════╣@t{EPILOG}╠═══════════╣ ║006 0 0 ║ ║ ╙──────╜ ║ ║007 0 ║ ║008 3 2 1 ║ ╙──────────────────────────────╜ ║009 3 2 ║ Original SMS scheme ║010 3 ║ ║ ╓─────────────────╖ ║ ╠══════╣@t{RESTORE LIVES-OUT}╠═════╣ ║ ╙─────────────────╜ ║ ║ Add register copies @code{0=3} ║ ║ for every variable that ║ ║ lives out to preserve ║ ║ soundness. ║ ╙──────────────────────────────╜ MVE scheme @end example @subsubsection Known pitfalls and performance bugs MVE sometimes causes performance drops when combined with O3. This is most often due to the lack of synergy between the loop-unroller and MVE unroller, which causes the loop to be unrolled way too much and the register pressure to increase drastically. As a side-effect, the loop iteration counter may become the coldest register, per construction of the currently SMS-compatible loops, and therefore will be the first one spilled. @quotation Note A hardware loop is a unique type of loop that utilizes dedicated registers - @code{loop_counter}, @code{loop_begin}, and @code{loop_end} - along with instructions to minimize the overhead of iterating through a section of code. @end quotation @quotation Remark GCC may, for many different reasons, some target-specific, cancel the upgrade of a doloop to a hardware loop. For this reason, GCC saves a hard register aside to act as a substitute iteration counter in case the doloop degenerates to a compare-and-jump, although a hardware loop would use its own micro-arch register. Yet, this safeguard effectively increases the register pressure by one during register allocation, and other pressure-sensitive passes, leading to suboptimal decisions in case the hardware loop is later committed. @end quotation @node SMS interoperability @section Interactions with other optimizations passes @cindex modulo scheduler, other passes @cindex sms, other passes No information on the modulo schedule survives the SMS pass. @subsubheading Options @itemize @bullet @item @option{-freschedule-modulo-scheduled-loops} Run sched2 on modulo scheduled loops. Off by default. @end itemize @subsection Interaction with register allocator @cindex modulo scheduler, ira @cindex sms, ira The register allocator may not have enough hardware registers to satisfy the live ranges requirements. That is, the register pressure (maximum number of overlapping live ranges) may exceed the available number of hardware registers. When this occurs, the register allocator may either spill a register to memory or split a liverange using a register to register copy. Modulo scheduling, even more so with MVE enabled, increases the register requirements, and therefore makes either of the above actions more likely. Aside from the performance impact of a spill (or split), these are new instructions the modulo scheduler did not account for in its schedule. Because the modulo scheduler seeks the highest usage of available resources, it is likely these new instructions won't fit in the modulo schedule, making it obsolete. Worse, no information on the modulo schedule survives to later passes, so there is no minimal effort to insert these disruptive instructions within the schedule. Then, three solutions present themselves: (1) move the modulo scheduler after register allocation, (2) provide SMS with forecasting tools (e.g. register pressure heuristic) about what may occur during register allocator and (3) keep SMS in pre-pass but count on an improved sched2 to repair the schedule. @subsection Interaction with sched2 @cindex modulo scheduler, sched2 @cindex sms, sched2 Sched2 uses an acyclic scheduler (called Haifa in sources) based on list scheduling which is incompatible with the cyclic nature of SMS. Yet, because of new instructions introduced by the various @t{split} passes and @t{ira}, sched2 is mandatory for performance. Even more so since it calls target-specific hooks, and perform bundling on some VLIW targets. There is another limitation to Haifa, by-product of its acyclicity; for each extended basic block (ebb) it schedules, Haifa starts from a blank page of latencies; the algorithm assumes no latencies drip over from the predecessors of an ebb into it, and therefore schedules the ebb's instructions from latency 0 (i.e. without any initializing offset). Yet the issue is then that the schedule obtained that way may not be optimal, since the first scheduled instruction might have to stall waiting for previously unaccounted latency whereas another schedule with correct latency dangles could have led to a different non-stalling first instruction. This observation holds even more true when scheduling a loop, where the set of 'latency predecessors' is much larger (previous iterations and pre-headers). Accounting for these dangles would reduce inter-iterations stalls. @quotation Remark On non-interlocked architectures this imprecision leads to incorrect schedules; schedulers then conservatively require all operations to complete (e.g. by inserting no-ops) before any branch is taken rather than in a timely manner, producing even less efficient schedules. @end quotation Therefore a refinement is proposed to the non-intrusive instruction bundling extension, which is to take inspiration from the @url{https://dl.acm.org/doi/10.5555/243846.243903,meld scheduling} algorithm and apply it to modulo scheduled loops.