Hi all,


I am running a DerivO3CPU basesd SE mode simulation with x86 ISA. The micro
benchmark that I am running contains a loop with independent multiply
instructions. An excerpt from the disassembly of the benchmark loop looks
something like this



  400c07:             48 0f af d2                         imul   %rdx,%rdx

  400c0b:             48 0f af db                         imul   %rbx,%rbx

…



When I look at the O3PipeView, I see that all the independent multiply
instructions are issued sequentially, even though there are 2 multiply
functional units and each of them is pipelined



[................f....dn.pi..c.r.................................................]-(
16664000.0) 0x00400c07.0 IMUL_R_R                  [     34983]

[................f....dn.p...ic.r................................................]-(
16664000.0) 0x00400c07.1 IMUL_R_R                  [     34984]

[................f....dn.p...ic.r................................................]-(
16664000.0) 0x00400c07.2 IMUL_R_R                  [     34985]

[................f....dn.p...i..c.r..............................................]-(
16664000.0) 0x00400c0b.0 IMUL_R_R                  [     34986]

[................f....dn.p......ic.r.............................................]-(
16664000.0) 0x00400c0b.1 IMUL_R_R                  [     34987]

[................f....dn.p......ic.r.............................................]-(
16664000.0) 0x00400c0b.2 IMUL_R_R                  [     34988]

…



Digging into it further I found that each of the IMUL_R_R instructions have
Implicit Register 0 and 1 (ProdHi and ProdLow) added as a source and
destination in the generated code. Following is the excerpt from
 decoder-ns-cc.inc.



Mul1sFlags::Mul1sFlags(…)

    {



…

….

               setSrcRegIdx(_numSrcRegs++, RegId(IntRegClass,
INTREG_FOLDED(src1, foldOBit)));

               setSrcRegIdx(_numSrcRegs++, RegId(IntRegClass,
INTREG_FOLDED(src2, foldOBit)));

               setSrcRegIdx(_numSrcRegs++, RegId(IntRegClass,
INTREG_IMPLICIT(0)));

               setDestRegIdx(_numDestRegs++, RegId(IntRegClass,
INTREG_IMPLICIT(0)));

               _numIntDestRegs++;

               setSrcRegIdx(_numSrcRegs++, RegId(IntRegClass,
INTREG_IMPLICIT(1)));

               setDestRegIdx(_numDestRegs++, RegId(IntRegClass,
INTREG_IMPLICIT(1)));



…

}



This results in all the independent multiply instructions to execute
sequentially and multiply throughput is 1/3.

If we have multiple functional units, then should these implicit registers
(ProdHi and ProdLo) be replicated for each of them, and if so, why add them
as source and destination at all?

Any clarifications or workaround for this?



Thanks,

Mohit
_______________________________________________
gem5-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to