On Thu, Jun 10, 2010 at 3:09 PM, Quentin Neill <quentin.neill....@gmail.com> wrote: > On Thu, Jun 10, 2010 at 4:08 PM, H.J. Lu <hjl.to...@gmail.com> wrote: >> On Thu, Jun 10, 2010 at 1:59 PM, Quentin Neill >> <quentin.neill....@gmail.com> wrote: >>> On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <l...@redhat.com> wrote: >>>> On 06/10/10 13:52, H.J. Lu wrote: >>>>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill >>>>> <quentin.neill....@gmail.com> wrote: >>>>>> Cross-posting Reza's call for feedback to the binutils list since it >>>>>> is relevant - s ee the last few paragraphs regarding how to >>>>>> "solve the alignment problem". >>>>>> >>>>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402 >>>>>> >>>>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<yazdani_r...@yahoo.com> >>>>>> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> We are in the process of adding a feature to GCC to take advantage >>>>>>> of a new hardware feature in the latest AMD micro processor. This >>>>>>> feature requires a certain mix, ordering and alignments in >>>>>>> instruction sequences to obtain the expected hardware performance. >>>>>>> >>>>>>> I am asking the community to review this high level implementation >>>>>>> design and give me direction or advice. >>>>>>> >>>>>>> The new hardware issues two windows of the size N bytes of >>>>>>> instructions in every cycle. It goes into accelerate mode if the >>>>>>> windows have the right combination of instructions or alignments. Our >>>>>>> goal is to maximize the IPC by proper instruction scheduling and >>>>>>> alignments. >>>>>>> >>>>>>> Here is a summary of the most important requirements: >>>>>>> >>>>>>> a) Maximum of N instructions per window. >>>>>>> b) An instruction may cross the first window. >>>>>>> c) Each window can have maximum of x memory loads and y memory >>>>>>> stores . >>>>>>> d) The total number of immediate constants in the instructions >>>>>>> of a window should not exceed k. >>>>>>> e) The first window must be aligned on 16 byte boundary. >>>>>>> f) A Window set terminates when a branch exists in a window. >>>>>>> g) The number of allowed prefixes varies for instructions. >>>>>>> h) A window set needs to be padded by prefixes in instructions >>>>>>> or terminated by nops to ensure adherence to the rules. >>>>>>> >>>>>>> We have the following implementation plan for GCC: >>>>>>> >>>>>>> 1) Modify the Haifa scheduler to make the desired arrangement of >>>>>>> instructions for the two dispatch windows. The scheduler is called >>>>>>> once before and once after register allocation as usual. In both >>>>>>> cases it performs dispatch scheduling along with its normal job of >>>>>>> instruction scheduling. >>>>>>> >>>>>>> The advantage of doing it before register allocation is avoiding >>>>>>> extra dependencies caused by register allocation which may become >>>>>>> an obstacle to movement of instructions. The advantage of doing >>>>>>> it after register allocation is a consideration for spilling code >>>>>>> which may be generated by the register allocator. >>>>>>> >>>>>>> The algorithm we use is: >>>>>>> >>>>>>> a) Considering the current dispatch window set, choose the first >>>>>>> instruction from ready queue that does not violate dispatch rules. >>>>>>> b) When an instruction is selected and scheduled, inform the >>>>>>> dispatcher code about the instruction. This step keeps track of the >>>>>>> instruction content of windows for future evaluation. It also manages >>>>>>> the window set by closing and opening new virtual dispatch windows. >>>>>>> >>>>>>> 2) Insertion of alignment code. >>>>>>> >>>>>>> In x86 alignment is done by inserting prefixes or by generating >>>>>>> nops. As the object code is generated by the assembler in GCC, some >>>>>>> information such as sizes of branches are unknown until assembly or >>>>>>> link time. To do alignments related to dispatch correctly in GCC, >>>>>>> we need to iteratively compute prefixes and branch sizes until >>>>>>> its convergence. This pass currently does not exist in GCC, but it >>>>>>> exists in the assembler. >>>>>>> >>>>>>> There are two possible approaches to solve alignment problem. >>>>>>> >>>>>>> a) Let the assembler performs the alignments and padding needed >>>>>>> to adhere with the new machine dispatching rules and avoid an extra >>>>>>> pass in GCC. >>>>>>> b) Add a new pass to mimic what assembler does before generating >>>>>>> the assembly listing in GCC and insert the required alignments. >>>>>>> >>>>>>> I appreciate your comments on the proposed implementation procedure >>>>>>> and the choices a or b above. >>>>>>> >>>>> >>>>> I don't this should be done in assembler. Assembler should just assemble >>>>> the assembly input. >>>> >>>> That adds quite a bit of complication to the compiler though -- getting the >>>> instruction lengths right (and thus proper packing & alignment) can be >>>> extremely difficult. I did some experiments with this on a target with >>>> *fixed* instruction lengths a while back and even though the port tried >>>> hard >>>> to get lengths right, it would routinely miss something. Ultimately I >>>> decided that it forcing the compiler to know instruction lengths with a >>>> very >>>> high degree of accuracy wasn't a sane thing to do. Dealing with variable >>>> instruction lengths just adds yet another complexity to the situation. >>>> Then >>>> add the complication of needing to add specific prefixes or nops and it >>>> just >>>> gets downright ugly. >>>> >>>> I'd probably approach this by having the compiler emit a directive which >>>> states what the desired alignment at a particular point should be, then >>>> allow the assembler to select the best method to get the desired alignment. >>> >>> Jeff, >>> >>> This is exactly part of our binutils side of the proposal, which I'll >>> outline now >>> >>> 1. Allow multiple prefixes for ADDR and DS (and possibly others) >>> a) multiple prefixes are benign in certain modes and are thus chosen for >>> padding >>> b) although ".byte" works, the "ds" and "addr" prefix mnemonics are >>> more explicit (and they don't trigger a call to >>> md_flush_pending_output) >>> >>> 2. Add new pseudo-op to delineate alignment boundaries. This is >>> needed to signal any dispatch engine (below) to pad. Here are my top >>> two candidates, any feedback is appreciated: >>> a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()" >>> b) ".padalign" which calla a new "md_pad_align()" >>> >>> 3. Add dispatch optimization infrastructure which >>> a) is guarded by -mtune flag (and possibly other -f style flags) >>> b) tracks assembled instruction attributes and their fragments >>> c) can pad (insert benign prefixes) into previously assembled fragments >>> d) maintains dispatch engine state (according to some subset of Reza's >>> rules) >>> >>> Discussion: >>> >>> The flags in 3a) should guard against these changes affecting current >>> behavior. >>> >>> The assembly tracking in 3b) is for bookkeeping only; the padding in >>> 3c) would only occur when a compiler uses the pseudo-op in 2) or when >>> the dispatch engine in 3d) signals. >>> >>> For compilers that know exactly how to pad for the new processor, the >>> ability to >>> pad explicitly using 1), 2), and .align/.balign/.p2align should be enough. >>> >>> For assembly programs and/or compilers that don't choose to do any >>> dispatch optimization, it's anticipated that the engine in 3d) would >>> be useful for optimizing for -mtune=bdver1 >>> >>> I'll post patches for these soon. >> >> Can you do it with directives only? > > In theory, if the compiler knows all sizes and offsets, yes (given > some way to add multiple prefixes). > > However in practice, no. > > To get GCC to know all would require replicating most assembler > functionality in GCC, including parsing, assembling, and sizing > (parts of output_insn() and its child output_*() functions). We > considered exposing one-line assembly as a library but you have to > provide (or reuse) the segment/frchain/fragment context, and I don't > think introducing a GCC->binutils dependency (other than runtime) > would be easy to introduce into the community. > > This wouldn't cover the assembly language case either. > > And remember, even if you have all the directives (and the > programmer/compiler knows all), the assembler must remember potential > padding locations until the decision (and knowledge about how) to pad > arrives. >
x86 assembler isn't an optimizing assembler. -mtune only does instruction selection. What you are proposing sounds like an optimizing assembler to me. Are we going to support scheduling, macro, ...? -- H.J.