On Thu, Jun 10, 2010 at 1:59 PM, Quentin Neill <quentin.neill....@gmail.com> wrote: > On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <l...@redhat.com> wrote: >> On 06/10/10 13:52, H.J. Lu wrote: >>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill >>> <quentin.neill....@gmail.com> wrote: >>>> Cross-posting Reza's call for feedback to the binutils list since it >>>> is relevant - s ee the last few paragraphs regarding how to >>>> "solve the alignment problem". >>>> >>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402 >>>> >>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<yazdani_r...@yahoo.com> >>>> wrote: >>>>> Hi, >>>>> >>>>> We are in the process of adding a feature to GCC to take advantage >>>>> of a new hardware feature in the latest AMD micro processor. This >>>>> feature requires a certain mix, ordering and alignments in >>>>> instruction sequences to obtain the expected hardware performance. >>>>> >>>>> I am asking the community to review this high level implementation >>>>> design and give me direction or advice. >>>>> >>>>> The new hardware issues two windows of the size N bytes of >>>>> instructions in every cycle. It goes into accelerate mode if the >>>>> windows have the right combination of instructions or alignments. Our >>>>> goal is to maximize the IPC by proper instruction scheduling and >>>>> alignments. >>>>> >>>>> Here is a summary of the most important requirements: >>>>> >>>>> a) Maximum of N instructions per window. >>>>> b) An instruction may cross the first window. >>>>> c) Each window can have maximum of x memory loads and y memory >>>>> stores . >>>>> d) The total number of immediate constants in the instructions >>>>> of a window should not exceed k. >>>>> e) The first window must be aligned on 16 byte boundary. >>>>> f) A Window set terminates when a branch exists in a window. >>>>> g) The number of allowed prefixes varies for instructions. >>>>> h) A window set needs to be padded by prefixes in instructions >>>>> or terminated by nops to ensure adherence to the rules. >>>>> >>>>> We have the following implementation plan for GCC: >>>>> >>>>> 1) Modify the Haifa scheduler to make the desired arrangement of >>>>> instructions for the two dispatch windows. The scheduler is called >>>>> once before and once after register allocation as usual. In both >>>>> cases it performs dispatch scheduling along with its normal job of >>>>> instruction scheduling. >>>>> >>>>> The advantage of doing it before register allocation is avoiding >>>>> extra dependencies caused by register allocation which may become >>>>> an obstacle to movement of instructions. The advantage of doing >>>>> it after register allocation is a consideration for spilling code >>>>> which may be generated by the register allocator. >>>>> >>>>> The algorithm we use is: >>>>> >>>>> a) Considering the current dispatch window set, choose the first >>>>> instruction from ready queue that does not violate dispatch rules. >>>>> b) When an instruction is selected and scheduled, inform the >>>>> dispatcher code about the instruction. This step keeps track of the >>>>> instruction content of windows for future evaluation. It also manages >>>>> the window set by closing and opening new virtual dispatch windows. >>>>> >>>>> 2) Insertion of alignment code. >>>>> >>>>> In x86 alignment is done by inserting prefixes or by generating >>>>> nops. As the object code is generated by the assembler in GCC, some >>>>> information such as sizes of branches are unknown until assembly or >>>>> link time. To do alignments related to dispatch correctly in GCC, >>>>> we need to iteratively compute prefixes and branch sizes until >>>>> its convergence. This pass currently does not exist in GCC, but it >>>>> exists in the assembler. >>>>> >>>>> There are two possible approaches to solve alignment problem. >>>>> >>>>> a) Let the assembler performs the alignments and padding needed >>>>> to adhere with the new machine dispatching rules and avoid an extra >>>>> pass in GCC. >>>>> b) Add a new pass to mimic what assembler does before generating >>>>> the assembly listing in GCC and insert the required alignments. >>>>> >>>>> I appreciate your comments on the proposed implementation procedure >>>>> and the choices a or b above. >>>>> >>> >>> I don't this should be done in assembler. Assembler should just assemble >>> the assembly input. >> >> That adds quite a bit of complication to the compiler though -- getting the >> instruction lengths right (and thus proper packing & alignment) can be >> extremely difficult. I did some experiments with this on a target with >> *fixed* instruction lengths a while back and even though the port tried hard >> to get lengths right, it would routinely miss something. Ultimately I >> decided that it forcing the compiler to know instruction lengths with a very >> high degree of accuracy wasn't a sane thing to do. Dealing with variable >> instruction lengths just adds yet another complexity to the situation. Then >> add the complication of needing to add specific prefixes or nops and it just >> gets downright ugly. >> >> I'd probably approach this by having the compiler emit a directive which >> states what the desired alignment at a particular point should be, then >> allow the assembler to select the best method to get the desired alignment. > > Jeff, > > This is exactly part of our binutils side of the proposal, which I'll > outline now > > 1. Allow multiple prefixes for ADDR and DS (and possibly others) > a) multiple prefixes are benign in certain modes and are thus chosen for > padding > b) although ".byte" works, the "ds" and "addr" prefix mnemonics are > more explicit (and they don't trigger a call to > md_flush_pending_output) > > 2. Add new pseudo-op to delineate alignment boundaries. This is > needed to signal any dispatch engine (below) to pad. Here are my top > two candidates, any feedback is appreciated: > a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()" > b) ".padalign" which calla a new "md_pad_align()" > > 3. Add dispatch optimization infrastructure which > a) is guarded by -mtune flag (and possibly other -f style flags) > b) tracks assembled instruction attributes and their fragments > c) can pad (insert benign prefixes) into previously assembled fragments > d) maintains dispatch engine state (according to some subset of Reza's rules) > > Discussion: > > The flags in 3a) should guard against these changes affecting current > behavior. > > The assembly tracking in 3b) is for bookkeeping only; the padding in > 3c) would only occur when a compiler uses the pseudo-op in 2) or when > the dispatch engine in 3d) signals. > > For compilers that know exactly how to pad for the new processor, the > ability to > pad explicitly using 1), 2), and .align/.balign/.p2align should be enough. > > For assembly programs and/or compilers that don't choose to do any > dispatch optimization, it's anticipated that the engine in 3d) would > be useful for optimizing for -mtune=bdver1 > > I'll post patches for these soon.
Can you do it with directives only? -- H.J.