On Tue, 25 Sep 2012, Richard Sandiford wrote: > >> According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4 > >> cycles on that processor). An MTHI/MTLO pair will take 2 cycles; > >> obviously the resulting larger code may adversely affect cache performance > >> in some scenarios. > > > > That's not how the 4650 DFA models it though. > > > > (define_insn_reservation "generic_hilo" 1 > > (eq_attr "type" "mfhi,mflo,mthi,mtlo") > > "imuldiv*3") > > > > (define_insn_reservation "r4650_imul" 4 > > (and (eq_attr "cpu" "r4650") > > (eq_attr "type" "imul,imul3,imadd")) > > "imuldiv*4") > > > > So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6 > > rather than 4 cycles. Any attempt to use the DFA would still favour MULT.
I can't track a reference on R4650 MTHI/MTLO latency; I'd be happy to learn of one, or otherwise I wonder where the delay is coming from. Also a small update: apparently MULT is 3 clocks only on the R4650 where operands are 16 bits (unsure if it is enough if only one is; for a zero by zero multiplication it surely does not matter though). So I think using a MULT here is at least reasonable. > Although I see the 4kp with its 32-cycle MULTs and MADDs is one where > MULT $0,$0 would be a really bad choice. Sigh. The amount of effort > required for this optimisation is getting a bit ridiculous. I have double-checked some documentation, and in fact many MIPS cores, including the current ones, have a configuration option to include either a high-performance or an area-efficient MD unit. Take the M14Kc for example -- its high-performance unit has a one-cycle latency/issue rate for 16-bit multiplication (two-cycle for full 32 bits; here the width of rt is explicitly named) and the area-efficient has a 32-cycle latency/issue rate only regardless of the operand size (obviously iterating over addition one bit at a time). The latency of MTHI/MTLO is 1 across both units. So I think this can't really be selected automatically for all cores, some human-supplied knowledge about the MD unit used is required -- that obviously affects other operations too, e.g. some multiplications involving a constant that may be cheaper to do either directly or with a sequence of additions depending on the MD unit present (unless optimising for size, of course). Maciej