Hi everyone,

I am writing the internals for the modulo scheduler,
that I will join together with my incoming serie of patches.
I would like your opinion on the attached texinfo file, and are these utf-8
box characters legal for our doc ?

It is still a work in progress. I have focused on detailing
what will always be true. See attached file.

Use texi2any --html modulo-scheduler.texi --no-split -c HTML_MATH=mathjax
-o modulo-scheduler.html

Benjamin
@c Copyright (C) 2019-2024 Free Software Foundation, Inc.
@c This is part of the GCC manual.
@c For copying conditions, see the file gcc.texi.
@c Contributed by Benjamin Priour <vultk...@gcc.gnu.org>.

@title Modulo Scheduler Internals

@node Modulo Scheduler
@chapter Modulo Scheduler
@cindex SMS
@cindex modulo scheduler

@menu
* Modulo Scheduler Overview::  Overview
* Pruned DDG:: How to make SMS more likely to succeed
* SMS interoperability:: Interoperability with other passes
* Debugging the Modulo Scheduler::   Useful debugging tips
@end menu

@node Modulo Scheduler Overview
@section Modulo Scheduler Overview
@cindex modulo scheduler
@cindex SMS

@subsubheading Background

@enumerate
@item J. Llosa, A. Gonzalez, E. Ayguade, M. Valero., and J. Eckhardt.
      @cite{@url{https://ieeexplore.ieee.org/document/910814,
        Lifetime-sensitive modulo scheduling in a production environment}}.
@item J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero.
      @cite{@url{https://www.researchgate.net/publication/2941400_Swing_Modulo_Scheduling_A_Lifetime-Sensitive_Approach,
      Swing Modulo Scheduling: A Lifetime Sensitive Approach}}.
@item M. Hagog and A. Zaks.
      @cite{@url{https://www.researchgate.net/publication/228937160_Swing_modulo_scheduling_for_gcc,
      Swing modulo scheduling for gcc}}.
@end enumerate

@subsubheading Related Options

@itemize @bullet

@item @option{-fmodulo-sched}

Turns on modulo scheduling optimization.
@item @option{--param=sms-dfa-history=}
@item @option{--param=sms-loop-average-count-threshold=}
@item @option{--param=sms-max-ii-factor=}
@item @option{--param=sms-min-sc=}

@end itemize

@subsection Introduction


@example

The original loop of length 11 cycles,
with many stalled cycles (i.e. cycles without any resource used).

                      @t{Iteration N}
╓─────────────────╥─────────────────╥─────────────────╖
║       @b{LSU}       ║       @b{FPU}       ║      @b{Other}      ║
╠═════════════════╬═════════════════╬═════════════════╣
║ v1 = [x4*8 +x1] ║      @i{wasted}     ║      @i{wasted}     ║
╠═════════════════╬═════════════════╬═════════════════╣
║ v2 = [x4*8 +x2] ║      @i{wasted}     ║      @i{wasted}     ║
╠═════════════════╬═════════════════╬═════════════════╣
║      @i{CYCLE}              @i{IS}              @i{STALLED}     ║
╠═════════════════╬═════════════════╬═════════════════╣
║      @i{wasted}     ║ v1 = v1*v0 + v2 ║      @i{wasted}     ║
╠═════════════════╬═════════════════╬═════════════════╣
║      @i{CYCLE}              @i{IS}              @i{STALLED}     ║
╠═════════════════╬═════════════════╬═════════════════╣
║      @i{CYCLE}              @i{IS}              @i{STALLED}     ║
╠═════════════════╬═════════════════╬═════════════════╣
║      @i{CYCLE}              @i{IS}              @i{STALLED}     ║
╠═════════════════╬═════════════════╬═════════════════╣
║      @i{CYCLE}              @i{IS}              @i{STALLED}     ║
╠═════════════════╬═════════════════╬═════════════════╣
║ [x4*8 + x3] = v1║      @i{wasted}     ║   x4 = x4 + 1   ║
╠═════════════════╬═════════════════╬═════════════════╣
║      @i{wasted}     ║      @i{wasted}     ║ cc = cmp(x0, x4)║
╠═════════════════╬═════════════════╬═════════════════╣
║      @i{wasted}     ║      @i{wasted}     ║   jump  ║
╙─────────────────╨─────────────────╨─────────────────╜


   @t{Iteration N}             Visualize that we slide successive iterations
  ╓────╥────╥────╖           each II cycles apart to build a new kernel.
  ║ @b{LSU}║ @b{FPU}║ @b{Oth}║
  ╠════╬════╬════╣
  ║ i1 ║    ║    ║
  ╠════╬════╬════╣
  ║ i2 ║    ║    ║ @t{Iteration N+1}
  ╠════╬════╬════╣╓────╥────╥────╖
  ║              ║║ @b{LSU}║ @b{FPU}║ @b{Oth}║
  ╠════╬════╬════╣╠════╬════╬════╣
  ║    ║ i3 ║    ║║ i1 ║    ║    ║
  ╠════╬════╬════╣╠════╬════╬════╣
  ║              ║║ i2 ║    ║    ║ @t{Iteration N+2}
  ╠════╬════╬════╣╠════╬════╬════╣╓────╥────╥────╖
  ║              ║║              ║║ @b{LSU}║ @b{FPU}║ @b{Oth}║
  ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣
  ║              ║║    ║ i3 ║    ║║ i1 ║    ║    ║
  ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣
  ║              ║║              ║║ i2 ║    ║    ║ @t{Iteration N+3}
  ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣╓────╥────╥────╖
  ║ i4 ║    ║ i5 ║║              ║║              ║║ @b{LSU}║ @b{FPU}║ @b{Oth}║
  ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣
  ║    ║    ║ i6 ║║              ║║    ║ i3 ║    ║║ i1 ║    ║    ║
  ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣
  ║    ║    ║jump║║              ║║              ║║ i2 ║    ║    ║
┅┅╙┅┅┅┅╨┅┅┅┅╨┅┅┅┅╜╠┅┅┅┅╬┅┅┅┅╬┅┅┅┅╣╠┅┅┅┅╬┅┅┅┅╬┅┅┅┅╣╠┅┅┅┅╬┅┅┅┅╬┅┅┅┅╣┅┅ N ENDS ┅┅
                  ║ i4 ║    ║ i5 ║║              ║║              ║
                  ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣
                  ║    ║    ║ i6 ║║              ║║    ║ i3 ║    ║
                  ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣
                  ║    ║    ║jump║║              ║║              ║
                  ╙────╨────╨────╜╠════╬════╬════╣╠════╬════╬════╣
                                  ║ i4 ║    ║ i5 ║║              ║
                                  ╠════╬════╬════╣╠════╬════╬════╣
                                  ║    ║    ║ i6 ║║              ║
                                  ╠════╬════╬════╣╠════╬════╬════╣
                                  ║    ║    ║jump║║              ║
                                  ╙────╨────╨────╜╠════╬════╬════╣
  Iteration N has ended, there is                 ║ i4 ║    ║ i5 ║
  no point to stack yet another iteration.        ╠════╬════╬════╣
                                                  ║    ║    ║ i6 ║
                                                  ╠════╬════╬════╣
                                                  ║    ║    ║jump║
                                                  ╙────╨────╨────╜
   @t{Iteration N}
  ╓────╥────╥────╖
  ║ @b{LSU}║ @b{FPU}║ @b{Oth}║
  ╠════╬════╬════╣━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ┓
  ║ i1 ║    ║    ║                  Actually we have found a new kernel
  ╠════╬════╬════╣                    containing exactly once every instruction
  ║ i2 ║    ║    ║ @t{Iteration N+1}        without any resource conflicts.
  ╠════╬════╬════╣╓────╥────╥────╖                                ┃
  ║              ║║ @b{LSU}║ @b{FPU}║ @b{Oth}║                                ┃
  ╠════╬════╬════╣╠════╬════╬════╣                                ┃
  ║    ║ i3 ║    ║║ i1 ║    ║    ║                                ┃@b{PROLOG}
  ╠════╬════╬════╣╠════╬════╬════╣                                ┃
  ║              ║║ i2 ║    ║    ║ @t{Iteration N+2}                  ┃
  ╠════╬════╬════╣╠════╬════╬════╣╓────╥────╥────╖                ┃
  ║              ║║              ║║ @b{LSU}║ @b{FPU}║ @b{Oth}║                ┃
  ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣                ┃
  ║              ║║    ║ i3 ║    ║║ i1 ║    ║    ║                ┃
  ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣━ ━ ━ ━ ━ ━ ━ ━ ┛
  ║              ║║              ║║ i2 ║    ║    ║ @t{Iteration N+3}
  ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣╓────╥────╥────╖
 ┏╠━━━━╬━━━━╬━━━━╣╠━━━━╬━━━━╬━━━━╣╠━━━━╬━━━━╬━━━━╣╠━━━━╬━━━━╬━━━━╣┓
 ┃║ i4 ║    ║ i5 ║║              ║║              ║║ @b{LSU}║ @b{FPU}║ @b{Oth}║┃
 ┃╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣┃
 ┃║    ║    ║ i6 ║║              ║║    ║ i3 ║    ║║ i1 ║    ║    ║┃@b{KERNEL}
 ┃╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣┃
 ┃║    ║    ║jump║║              ║║              ║║ i2 ║    ║    ║┃
 ┗╙━━━━╨━━━━╨━━━━╜╠━━━━╬━━━━╬━━━━╣╠━━━━╬━━━━╬━━━━╣╠━━━━╬━━━━╬━━━━╣┛
 ┏━ ━ ━ ━ ━ ━ ━ ━ ║ i4 ║    ║ i5 ║║              ║║              ║
 ┃                ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣
 ┃  @b{EPILOG}        ║    ║    ║ i6 ║║              ║║    ║ i3 ║    ║
 ┃                ╠════╬════╬════╣╠════╬════╬════╣╠════╬════╬════╣
 ┃                ║    ║    ║jump║║              ║║              ║
 ┃                ╙────╨────╨────╜╠════╬════╬════╣╠════╬════╬════╣
 ┃                                ║ i4 ║    ║ i5 ║║              ║
 ┃╓────╥────╥────╖                ╠════╬════╬════╣╠════╬════╬════╣
 ┃║ @b{LSU}║ @b{FPU}║ @b{Oth}║                ║    ║    ║ i6 ║║              ║
 ┃╠════╬════╬════╣                ╠════╬════╬════╣╠════╬════╬════╣
 ┃┃ i4 ┃    ┃ i5 ┃                ║    ║    ║jump║║              ║
 ┃┣━━━━╋━━━━╋━━━━┫                ╙────╨────╨────╜╠════╬════╬════╣
 ┃┃ i1 ┃ i3 ┃ i6 ┃ II=3 success!                  ║ i4 ║    ║ i5 ║
 ┃┣━━━━╋━━━━╋━━━━┫                                ╠════╬════╬════╣
 ┃┃ i2 ┃    ┃jump┃                                ║    ║    ║ i6 ║
 ┃┖────┸────┸────┚                                ╠════╬════╬════╣
 ┃                                                ║    ║    ║jump║
 ┗━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ╙────╨────╨────╜

@end example

Modulo scheduling is a software pipelining technique. Contrary to an acyclic
scheduling algorithm such as list scheduling, a modulo scheduler considers
inter-iterations dependencies to produce its schedule. Instead, modulo
scheduling is a @emph{global} (goes beyond basic block boundaries)
@emph{cyclic} (processes back edges) scheduling technique which aims to
increase the throughput of the innermost loops.

Modulo scheduling attempts to combine multiple successive iterations
into a new smaller kernel of length @strong{@acronym{II, Initiation Interval}}
cycles, such that the hardware resources (ALU, LSU...) are reused modulo II.

A modulo schedule is often represented as a table, where each row represents a
cycle in [0, II-1], and each column represents an execution unit.

A modulo scheduler works on a @ref{ddg-def,,Data Dependence Graph}.

SMS's workflow is given below.

@smallexample

  Create DDG for loop.
        ┃
        v
  Compute MII = Max(RecMII, ResMII)
  and MaxII, set II = MII
        ┃
        ┃<━━━━━━━━━━━ II++ ━━━━━━━━━━━━┓
        v                              ┃
  Schedule nodes from                  ┃
  time 0 to infinity.                  ┃
        ┃                              ┃
        ┠━━━━ If failed in ━━━━━━━━━━━━┨
        ┃     scheduling all insns     ┃
        ┃                              ┃
    If success                         ┃
        ┃                              ┃
        ┃                              ┃
        v                              ┃
  Resolve live ranges overflows        ┃
  with regmoves or MVE.                ┃
        ┃                              ┃
        ┃                              ┃
        ┠━━━ If failed to insert ━━━━━━┛
        ┃    register copies into
        ┃    schedule (MVE: if KUF
        ┃    too large)
        ┃
        ┃
  If we succeeded in scheduling
  the loop in II cycles
        ┃
        v
  Generate prolog and epilog and decrease the
  counter of the loop. If runtime counter,
  do loop versioning.

@end smallexample

@subsubsection Limitations on loops

SMS works with countable loops (1) whose control part can be easily
decoupled from the rest of the loop and (2) whose loop count can
be easily adjusted.  This is because we peel a constant number of
iterations into a prologue and epilogue for which we want to avoid
emitting the control part, and a kernel which is to iterate that
constant number of iterations less than the original loop.  So the
control part should be a set of insns clearly identified and having
its own iv, not otherwise used in the loop (at-least for now), which
initializes a register before the loop to the number of iterations.

Currently SMS relies on the do-loop pattern to recognize such loops,
where (1) the control part comprises of all insns defining and/or
using a certain 'count' register and (2) the loop count can be
adjusted by modifying this register prior to the loop.

@subsection The Swing algorithm

The modulo scheduler within GCC uses the Swing algorithm, which is a near optimal
modulo scheduling algorithm (see [1,2]). It has been implemented in 2004
(see [3]) and has only received rare attention since then.

The algorithm is further described as a comment somewhere in
@file{modulo-sched.cc}. All these ressources should suffice the reader
to understand the Swing algorithm, without another repeat here.


@subsubsection @acronym{ResMII, Resource Minimum Initiation Interval}


@subsubsection @acronym{RecMII, Recurrence Minimum Initiation Interval}



@node Pruned DDG
@section How to make SMS more likely to succeed
@cindex modulo scheduler, pruned DDG
@cindex SMS, pruned DDG

@subsubheading Options

@itemize @bullet

@item @option{-fmodulo-sched-allow-regmoves}

Allow SMS to use a pruned DDG and insert register copies to fix overlaps.

@item @option{-fmodulo-sched-allow-mve}

Allow SMS to use a pruned DDG and unroll the kernel to fix overlaps.

@end itemize

@anchor{ddg-def}
A @acronym{DDG, Data-Dependence Graph} is a directed graph whose nodes are
operations and arcs are data dependencies. A data dependence may be one of
3 types: a @acronym{RAW, Read-after-Write} aka TRUE, a
@acronym{WAR, Write-after-Read} aka ANTI or a @acronym{WAW, Write-after-Write}
aka OUTPUT.

The DDG of a loop is always cyclic, because for each RAW arc @math{(x,y)}  of cost
@code{latency(x)} there is an equivalent WAR arc @math{(y,x)} of cost
@code{latency(y)}, simply because if at iteration N @math{x_N} comes before @math{y_N},
then this @math{y_N} comes before @math{x_{N+1}}, which inverts the dependency.


@quotation REgister lifetime

The lifetime of a register is defined as the duration between the first
assignment into the register and its last use which may be in an ulterior
iteration. If the lifetime of a variable is @math{l}, and an iteration is
initiated every @acronym{II} cycles, then at least
@math{r_@{i,min@} = \lceil \frac@{l@}@{II@}\rceil}
versions (values) of that variable must be kept alive simultaneously at
all time.

@end quotation

@smallexample
━>: Read-after-Write dep
┅>: Write-after-Read dep
┈>: Write-after-Write dep

      ╭────╮
      │ i1 │
      ╰────╯
        ┃^
        v┋
      ╭────╮
      │ i2 │
      ╰────╯
      ^┃  ┃^
  ┏┅┅┅┛┃  ┃┗┅┅┅┓
  ┋┏━━━┛  ┗━━━┓┋
  ┋┃          ┃┋
  ┋v          v┋
 ╭────╮     ╭────╮
 │ i3 │     │ i4 │
 ╰────╯     ╰────╯

@end smallexample

Having either arc orders the two nodes relatively to one another. It indicates
which one must comes before the other. The second arc constrains the distance
between them. This distance constraints guarantees that upon successful
schedule at II, every register liverange is smaller than II cycles.
However, this more constrained DDG makes scheduling more difficult for SMS,
and it fails more often. The solution therefore is to keep only one arc (RAW)
where there is two. Doing so guarantees a sound ordering, but some live range
may no longer fit into II cycles. For these, either insert register copies
(if regmoves) or unroll the kernel (if @acronym{MVE,
Modulo Variable Expansion}) to prevent a definition from overwriting itself
before reaching the use.

@smallexample
━>: Read-after-Write dep
┅>: Write-after-Read dep
┈>: Write-after-Write dep

      ╭────╮
      │ i1 │
      ╰────╯
        ┃
        v
      ╭────╮
      │ i2 │
      ╰────╯
       ┃  ┃
       ┃  ┃
   ┏━━━┛  ┗━━━┓
   ┃          ┃
   v          v
 ╭────╮     ╭────╮
 │ i3 │     │ i4 │
 ╰────╯     ╰────╯

@end smallexample


@subsection Register to register moves

Part of the original 2004 implementation as an approximation to
Modulo Variable Expansion. Enabled by @option{-fmodulo-sched-allow-regmoves},
this will build a pruned DDG. Then once the DDG's nodes are scheduled, count
the number or registers


@subsubsection Limitations

Failure to insert the moves in schedule.
Long-latency synergy.
Increase compile time in case of failure.

@subsection Modulo Variable Expansion
@cindex MVE

@acronym{MVE, Modulo Variable Expansion} is similar to the variable expansion
technique but comparatively reduces the number of locations allocated to a
register by reusing the same location in non-overlapping iterations. Therefore
the minimum number of extra registers for each redefined variable can be lower
than the unroll factor.

It is an alternative to @option{-fmodulo-sched-allow-regmoves} that performs
better in presence of long-latency moves, or when the loop is small (in terms
of its number of instructions) and contains any long-latency operations.
In any other cases, except for performance bugs cited reported below, it
performs similarly to 'regmoves'. This option is turned on with
@option{-fmodulo-sched-allow-mve}.

@subsubheading Algorithm

@enumerate
@item Pretend that every iteration of the loop has a dedicated register
location for each variable, and remove all inter-iteration precedence
constraints.

@item Modulo schedule as usual.

@item Use the schedule thereafter obtained to calculate the lifetime of all
variables using the method explained below. This step will also determine the
minimum @acronym{KUF, Kernel Unroll Factor}.

@item Increment @math{R_i} the number of register allocated to each variable
from @math{r_{i,min}} to a factor of @acronym{KUF}.

@item Unroll the loop @acronym{KUF}
times to obtain a new larger kernel made of @acronym{KUF}
successive copies of the original kernel.

@item Modulo rename each variable across the prologue, epilogue and kernels.

@item Generate register copies after the epilogue to preserve correctness on live-out variables.
@end enumerate

@subsubsection Kernel Unroll Factor

The minimum unrolling factor is simply @math{\mathsf@{KUF@} = \max_i r_@{i,min@}}.
It can be achieved by setting the effective number of registers @math{R_i}
(original register plus extras) allocated to variable @math{v_i} to be the
smallest factor of @acronym{KUF} that is no smaller than @math{r_@{i,min@}},
i.e.,

@displaymath
R_i = \min\left\{r_i \;\vert\; (r_i \geq r_{i,min}) \land (r_i \bmod \mathsf{KUF}=0)\right\}
@end displaymath

In total there will be $\sum (R_i - 1)$ extra registers, and by construction
@math{r_@{max@} = \max R_i = \mathsf@{KUF@}}. The GCC currently assumes that
the loop consists of a single basic block without any early exits or control
structures (such as @t{if} statements). This characteristic allows for the
implementation of a straightforward loop unrolling algorithm, which involves
duplicating all instructions in the loop except for the jump instruction
@acronym{KUF} times. However, care must be taken to accurately recreate the
liveness metadata of the duplicated instructions. This is possible because none
of the instructions, other than the jump instruction which is not duplicated,
utilize or define the iteration counter, eliminating the need for additional
adjustments.

@subsubsection Register Requirements

@cartouche
@b{Stage Variant}

Let @math{\mathcal@{R@}_@{v_i@}} be the non-empty set of @math{r_i} distinct
registers used by variable @math{v_i}. Let @math{r^j_i, 1 \leq j \leq r_i} be
the elements of that set and @math{r^0_i} is the register found in the original
non-renamed instruction. When a stage @math{g} is said to be in variant
@math{k}, it means that for every instruction @t{insn} of that stage, if
@t{insn} redefines the variable @math{v_i} with value @math{x}, then it stores
this value @math{x} into register @math{r^@{k \bmod r_i@}_i} of
@math{\mathcal@{R@}_@{v_i@}}.
@end cartouche

All instructions that @b{use} variable @math{v_i} between its last definition
as variant @math{k} and before its next definition in variant
@math{(k + 1) \bmod KUF} will use register @math{r^@{k \bmod r_i@}_i}.
Furthermore in a given row of the unrolled schedule, not all stages are in
the same variant yet all stage variants increment by one from a schedule row
to the next.

@quotation Remark
From the above follows that for instructions that both define and use the same
variable @math{v_i} (i.e. of the form @math{v_i = f(v_i)}), the renaming to
variant @math{k} becomes @math{r^k_i = f(r^@{k - 1 \bmod r_i@}_i)}. 
@end quotation

Some variable (re-)defined within the non-pipelined loop may be still used
after the loop. It is therefore crucial to add compensation register copies
after the epilogue so that the latest updates on the live-out variables
are not lost within one of their extra register, but rather reside in their
original register.

The choice has been made to define the first appearance of each stage in the
modulo schedule (i.e. the first non-empty cell of each column in the
@ref{tab:mve-scheme,,MVE Scheme}) to be in variant 1. This allows for a natural
filling of the pipeline without prior register moves, since all uses-before-def
of each variable @math{v_i} will use their original register @math{r^0_i}.
The code generation schema to apply @acronym{MVE, Modulo Variable Expansion} is
presented in @ref{tab:mve-scheme,,MVE Scheme}.

@anchor{tab:mve-scheme}
@example
With @code{SC=4} stages and @code{KUF=4} unroll factor

                                           ╓──────────╥───────────────────╖
                                           ║ @b{Schedule} ║   @b{Stage Variant}   ║
                                           ║   @b{Step}   ║                   ║
                                           ╠══════════╬═══════════════════╣
                                           ║          ║       @b{STAGE}       ║
                                           ║          ║000  001  002  003 ║
                                           ║        ╓─╨──────────╖        ║
╓──────────╥───────────────────╖           ╠════════╣@t{VERSION LOOP}╠════════╣
║ @b{Schedule} ║   @b{Stage Variant}   ║           ║        ╙────────────╜        ║
║   @b{Step}   ║                   ║           ║ Adjust loop counter by SC    ║
╠══════════╬═══════════════════╣           ║ or version if runtime        ║
║          ║       @b{STAGE}       ║           ║ counter.                     ║
║          ║000  001  002  003 ║           ║       ╓───────────────╖      ║
║        ╓─╨──────────╖        ║           ╠═══════╣@t{PRECONDITIONING}╠══════╣
╠════════╣@t{VERSION LOOP}╠════════╣           ║       ╙───────────────╜      ║
║        ╙────────────╜        ║           ║ Adjust loop counter by KUF   ║
║ Adjust loop counter by SC    ║           ║ or make Duff's device with   ║
║ or version if runtime        ║           ║ non-pipelined loop body.     ║
║ counter.                     ║           ║           ╓──────╖           ║
║           ╓──────╖           ║           ╠═══════════╣@t{PROLOG}╠═══════════╣
╠═══════════╣@t{PROLOG}╠═══════════╣           ║           ╙──────╜           ║
║           ╙──────╜           ║           ║001          1                ║
║001          0                ║           ║002          2    1           ║
║002          0    0           ║           ║003          3    2    1      ║
║003          0    0    0      ║           ║           ╓──────╖           ║
║           ╓──────╖           ║           ╠═══════════╣@t{KERNEL}╠═══════════╣
╠═══════════╣@t{KERNEL}╠═══════════╣           ║           ╙──────╜           ║
║           ╙──────╜           ║           ║004          0    3    2    1 ║
║004          0    0    0    0 ║           ║005          1    0    3    2 ║
║           ╓──────╖           ║           ║006          2    1    0    3 ║
╠═══════════╣@t{EPILOG}╠═══════════╣           ║007          3    2    1    0 ║
║           ╙──────╜           ║           ║           ╓──────╖           ║
║005               0    0    0 ║           ╠═══════════╣@t{EPILOG}╠═══════════╣
║006                    0    0 ║           ║           ╙──────╜           ║
║007                         0 ║           ║008               3    2    1 ║
╙──────────────────────────────╜           ║009                    3    2 ║
      Original SMS scheme                  ║010                         3 ║
                                           ║      ╓─────────────────╖     ║
                                           ╠══════╣@t{RESTORE LIVES-OUT}╠═════╣
                                           ║      ╙─────────────────╜     ║
                                           ║  Add register copies @code{0=3}     ║
                                           ║  for every variable that     ║
                                           ║  lives out to preserve       ║
                                           ║  soundness.                  ║
                                           ╙──────────────────────────────╜
                                                      MVE scheme

@end example

@subsubsection Known pitfalls and performance bugs

MVE sometimes causes performance drops when combined with O3.
This is most often due to the lack of synergy between the loop-unroller
and MVE unroller, which causes the loop to be unrolled way too much and
the register pressure to increase drastically.

As a side-effect, the loop iteration counter may become the coldest register,
per construction of the currently SMS-compatible loops, and therefore will be
the first one spilled.

@quotation Note
A hardware loop is a unique type of loop that utilizes dedicated registers
- @code{loop_counter}, @code{loop_begin}, and @code{loop_end} - along with
instructions to minimize the overhead of iterating through a section of code.
@end quotation


@quotation Remark
GCC may, for many different reasons, some target-specific, cancel the upgrade
of a doloop to a hardware loop. For this reason, GCC saves a hard register
aside to act as a substitute iteration counter in case the doloop degenerates
to a compare-and-jump, although a hardware loop would use its own micro-arch
register. Yet, this safeguard effectively increases the register pressure by
one during register allocation, and other pressure-sensitive passes, leading
to suboptimal decisions in case the hardware loop is later committed.
@end quotation


@node SMS interoperability
@section Interactions with other optimizations passes
@cindex modulo scheduler, other passes
@cindex sms, other passes

No information on the modulo schedule survives the SMS pass.

@subsubheading Options

@itemize @bullet

@item @option{-freschedule-modulo-scheduled-loops}

Run sched2 on modulo scheduled loops. Off by default.

@end itemize


@subsection Interaction with register allocator
@cindex modulo scheduler, ira
@cindex sms, ira

The register allocator may not have enough hardware registers to satisfy
the live ranges requirements. That is, the register pressure (maximum number
of overlapping live ranges) may exceed the available number of hardware
registers. When this occurs, the register allocator may either spill a register
to memory or split a liverange using a register to register copy.

Modulo scheduling, even more so with MVE enabled, increases the register
requirements, and therefore makes either of the above actions more likely.
Aside from the performance impact of a spill (or split), these are new
instructions the modulo scheduler did not account for in its schedule.
Because the modulo scheduler seeks the highest usage of available resources,
it is likely these new instructions won't fit in the modulo schedule,
making it obsolete.

Worse, no information on the modulo schedule survives to later passes,
so there is no minimal effort to insert these disruptive instructions within
the schedule.

Then, three solutions present themselves: (1) move the modulo scheduler
after register allocation, (2) provide SMS with forecasting tools (e.g.
register pressure heuristic) about what may occur during register allocator
and (3) keep SMS in pre-pass but count on an improved sched2 to repair the
schedule.

@subsection Interaction with sched2
@cindex modulo scheduler, sched2
@cindex sms, sched2

Sched2 uses an acyclic scheduler (called Haifa in sources) based on list
scheduling which is incompatible with the cyclic nature of SMS.
Yet, because of new instructions introduced by the various @t{split} passes
and @t{ira}, sched2 is mandatory for performance. Even more so since
it calls target-specific hooks, and perform bundling on some VLIW targets.


There is another limitation to Haifa, by-product of its acyclicity;
for each extended basic block (ebb) it schedules, Haifa starts from a blank
page of latencies; the algorithm assumes no latencies drip over from the
predecessors of an ebb into it, and therefore schedules the ebb's instructions
from latency 0 (i.e. without any initializing offset). Yet the issue is then
that the schedule obtained that way may not be optimal, since the first
scheduled instruction might have to stall waiting for previously unaccounted
latency whereas another schedule with correct latency dangles could have led
to a different non-stalling first instruction.

This observation holds even more true when scheduling a loop, where the set of
'latency predecessors' is much larger (previous iterations and pre-headers).
Accounting for these dangles would reduce inter-iterations stalls.

@quotation Remark
On non-interlocked architectures this imprecision leads to
incorrect schedules; schedulers then conservatively require all operations
to complete (e.g. by inserting no-ops) before any branch is taken rather than
in a timely manner, producing even less efficient schedules.  
@end quotation

Therefore a refinement is proposed to the non-intrusive instruction bundling
extension, which is to take inspiration from the
@url{https://dl.acm.org/doi/10.5555/243846.243903,meld scheduling} algorithm
and apply it to modulo scheduled loops.


Reply via email to