Re: Scheduling x86 dispatch windows

H.J. Lu Thu, 10 Jun 2010 16:21:45 -0700

On Thu, Jun 10, 2010 at 3:09 PM, Quentin Neill
<quentin.neill....@gmail.com> wrote:
> On Thu, Jun 10, 2010 at 4:08 PM, H.J. Lu <hjl.to...@gmail.com> wrote:
>> On Thu, Jun 10, 2010 at 1:59 PM, Quentin Neill
>> <quentin.neill....@gmail.com> wrote:
>>> On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <l...@redhat.com> wrote:
>>>> On 06/10/10 13:52, H.J. Lu wrote:
>>>>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
>>>>> <quentin.neill....@gmail.com>  wrote:
>>>>>> Cross-posting Reza's call for feedback to the binutils list since it
>>>>>> is relevant - s ee the last few paragraphs regarding how to
>>>>>> "solve the alignment problem".
>>>>>>
>>>>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>>>>>
>>>>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<yazdani_r...@yahoo.com>
>>>>>>  wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> We are in the process of adding a feature to GCC to take advantage
>>>>>>> of a new hardware feature in the latest AMD micro processor. This
>>>>>>> feature requires a certain mix, ordering and alignments in
>>>>>>> instruction sequences to obtain the expected hardware performance.
>>>>>>>
>>>>>>> I am asking the community to review this high level implementation
>>>>>>> design and give me direction or advice.
>>>>>>>
>>>>>>> The new hardware issues two windows of the size N bytes of
>>>>>>> instructions in every cycle. It goes into accelerate mode if the
>>>>>>> windows have the right combination of instructions or alignments. Our
>>>>>>> goal is to maximize the IPC by proper instruction scheduling and
>>>>>>> alignments.
>>>>>>>
>>>>>>> Here is a summary of the most important requirements:
>>>>>>>
>>>>>>> a) Maximum of N instructions per window.
>>>>>>> b) An instruction may cross the first window.
>>>>>>> c) Each window can have maximum of x memory loads and y memory
>>>>>>>    stores .
>>>>>>> d) The total number of immediate constants in the instructions
>>>>>>>    of a window should not exceed k.
>>>>>>> e) The first window must be aligned on 16 byte boundary.
>>>>>>> f) A Window set terminates when a branch exists in a window.
>>>>>>> g) The number of allowed prefixes varies for instructions.
>>>>>>> h) A window set needs to be padded by prefixes in instructions
>>>>>>>    or terminated by nops to ensure adherence to the rules.
>>>>>>>
>>>>>>> We have the following implementation plan for GCC:
>>>>>>>
>>>>>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>>>>>    instructions for the two dispatch windows. The scheduler is called
>>>>>>>    once before and once after register allocation as usual. In both
>>>>>>>    cases it performs dispatch scheduling along with its normal job of
>>>>>>>    instruction scheduling.
>>>>>>>
>>>>>>> The advantage of doing it before register allocation is avoiding
>>>>>>> extra dependencies caused by register allocation which may become
>>>>>>> an obstacle to movement of instructions.  The advantage of doing
>>>>>>> it after register allocation is a consideration for spilling code
>>>>>>> which may be generated by the register allocator.
>>>>>>>
>>>>>>> The algorithm we use is:
>>>>>>>
>>>>>>> a) Considering the current dispatch window set, choose the first
>>>>>>>    instruction from ready queue that does not violate dispatch rules.
>>>>>>> b) When an instruction is selected and scheduled, inform the
>>>>>>>    dispatcher code about the instruction. This step keeps track of the
>>>>>>>    instruction content of windows for future evaluation. It also manages
>>>>>>>    the window set by closing and opening new virtual dispatch windows.
>>>>>>>
>>>>>>> 2) Insertion of alignment code.
>>>>>>>
>>>>>>> In x86 alignment is done by inserting prefixes or by generating
>>>>>>> nops. As the object code is generated by the assembler in GCC, some
>>>>>>> information such as sizes of branches are unknown until assembly or
>>>>>>> link time. To do alignments related to dispatch correctly in GCC,
>>>>>>> we need to iteratively compute prefixes and branch sizes until
>>>>>>> its convergence. This pass currently does not exist in GCC, but it
>>>>>>> exists in the assembler.
>>>>>>>
>>>>>>> There are two possible approaches to solve alignment problem.
>>>>>>>
>>>>>>> a)  Let the assembler performs the alignments and padding needed
>>>>>>>     to adhere with the new machine dispatching rules and avoid an extra
>>>>>>>     pass in GCC.
>>>>>>> b)  Add a new pass to mimic what assembler does before generating
>>>>>>>     the assembly listing in GCC and insert the required alignments.
>>>>>>>
>>>>>>> I appreciate your comments on the proposed implementation procedure
>>>>>>> and the choices a or b above.
>>>>>>>
>>>>>
>>>>> I don't this should be done in assembler. Assembler should just assemble
>>>>> the assembly input.
>>>>
>>>> That adds quite a bit of complication to the compiler though -- getting the
>>>> instruction lengths right (and thus proper packing & alignment) can be
>>>> extremely difficult.  I did some experiments with this on a target with
>>>> *fixed* instruction lengths a while back and even though the port tried 
>>>> hard
>>>> to get lengths right, it would routinely miss something.  Ultimately I
>>>> decided that it forcing the compiler to know instruction lengths with a 
>>>> very
>>>> high degree of accuracy wasn't a sane thing to do.    Dealing with variable
>>>> instruction lengths just adds yet another complexity to the situation.  
>>>> Then
>>>> add the complication of needing to add specific prefixes or nops and it 
>>>> just
>>>> gets downright ugly.
>>>>
>>>> I'd probably approach this by having the compiler emit a directive which
>>>> states what the desired alignment at a particular point should be, then
>>>> allow the assembler to select the best method to get the desired alignment.
>>>
>>> Jeff,
>>>
>>> This is exactly part of our binutils side of the proposal, which I'll
>>> outline now
>>>
>>> 1. Allow multiple prefixes for ADDR and DS (and possibly others)
>>> a) multiple prefixes are benign in certain modes and are thus chosen for 
>>> padding
>>> b) although ".byte" works, the "ds" and "addr" prefix mnemonics are
>>> more explicit (and they don't trigger a call to
>>> md_flush_pending_output)
>>>
>>> 2. Add new pseudo-op to delineate alignment boundaries.  This is
>>> needed to signal any dispatch engine (below) to pad.  Here are my top
>>> two candidates, any feedback is appreciated:
>>> a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()"
>>> b) ".padalign" which calla a new "md_pad_align()"
>>>
>>> 3. Add dispatch optimization infrastructure which
>>> a) is guarded by -mtune flag (and possibly other -f style flags)
>>> b) tracks assembled instruction attributes and their fragments
>>> c) can pad (insert benign prefixes) into previously assembled fragments
>>> d) maintains dispatch engine state (according to some subset of Reza's 
>>> rules)
>>>
>>> Discussion:
>>>
>>> The flags in 3a) should guard against these changes affecting current 
>>> behavior.
>>>
>>> The assembly tracking in 3b) is for bookkeeping only; the padding in
>>> 3c) would only occur when a compiler uses the pseudo-op in 2) or when
>>> the dispatch engine in 3d) signals.
>>>
>>> For compilers that know exactly how to pad for the new processor, the
>>> ability to
>>> pad explicitly using 1), 2), and .align/.balign/.p2align should be enough.
>>>
>>> For assembly programs and/or compilers that don't choose to do any
>>> dispatch optimization, it's anticipated that the engine in 3d) would
>>> be useful for optimizing for -mtune=bdver1
>>>
>>> I'll post patches for these soon.
>>
>> Can you do it with directives only?
>
> In theory, if the compiler knows all sizes and offsets, yes (given
> some way to add multiple prefixes).
>
> However in practice, no.
>
> To get  GCC to know all would require replicating most assembler
> functionality in  GCC, including parsing, assembling, and sizing
> (parts of output_insn() and its child output_*() functions).  We
> considered exposing one-line assembly as a library but you have to
> provide (or reuse) the segment/frchain/fragment context, and I don't
> think introducing a GCC->binutils dependency (other than runtime)
> would be easy to introduce into the community.
>
> This wouldn't cover the assembly language case either.
>
> And remember, even if you have all the directives (and the
> programmer/compiler knows all), the assembler must remember potential
> padding locations until the decision (and knowledge about how) to pad
> arrives.
>


x86 assembler isn't an optimizing assembler. -mtune only does
instruction selection.  What you are proposing sounds like an optimizing
assembler to me. Are we going to support scheduling, macro, ...?


-- 
H.J.

Re: Scheduling x86 dispatch windows

Reply via email to