[Bug tree-optimization/68557] Missed x86 peephole optimization for multiplying by a bool

peter at cordes dot ca Wed, 03 Feb 2016 22:50:19 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68557


Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
Besides code-size, uop-cache size is a factor for Intel CPUs.  imul is only a
single uop, while neg/and is 2 uops.  Total number of instructions is a factor
for other CPUs, too, but only locally.  (Saving uop-cache space can mean
speedups for *other* code that doesn't get evicted).


If the operation isn't part of a long dependency chain, imul is a better choice
on almost all CPUs.  Let OOO execution sort it out.

When latency matters some, we have to weigh the tradeoff of code-size / more
insns and uops vs. slightly (or much) higher latency.

Agner Fog's instruction tables indicate that 32bit imul is probably ok for
tune=generic, but 64bit imul should maybe only be used with -mtune=intel (but
absolutely not with tune=atom.  Maybe not with tune=silvermont either, but it
does have modest OOO capabilities to hide the latency.  It's not as wide, so
saving insns maybe matters more?).  I'm not sure if tune=intel is supposed to
put much weight on pre-Silvermont Atom.

From Agner Fog's spreadsheet, updated 2016-Jan09:

           uops/m-ops   latency   recip-throughput   execution pipe/port
Intel:SnB-family(Sandybridge through Skylake)
imul r32,r32:  1          3            1          p1
imul r64,r64:  1          3            1          p1

AMD:bdver1-3
imul r32,r32:  1          4            2          EX1
imul r64,r64:  1          6            4          EX1


Intel:Silvermont
imul r32,r32:  1          3            1          IP0
imul r64,r64:  1          5            2          IP0

AMD:bobcat/jaguar
imul r32,r32:  1          3            1          I0
imul r64,r64:  1          6            4          I0




old HW
Intel:Nehalem
imul r32,r32:  1          3            1          p1          
imul r64,r64:  1          3            1          p0
Intel:Merom/Penryn(Core2)
imul r32,r32:  1          3            1          p1          
imul r64,r64:  1          5            2          p0  (same as FP mul, maybe
borrows its wider multiplier?)

Intel:Atom
imul r32,r32:  1          5            2          Alu0,Mul
imul r64,r64:  6         13           11          Alu0,Mul

AMD:K8/K10
imul r32,r32:  1          3            1          ALU0
imul r64,r64:  1          4            2          ALU0_1 (uses units 0 and 1)

VIA:Nano3000
imul r32,r32:  1          2            1          I2
imul r64,r64:  1          5            2          MA


If gcc keeps track of execution port pressure at all, it should also avoid imul
when surrounding code is multiply-heavy (or doing other stuff that also
contends for the same resources as imul).  I didn't check on neg/and, but I
assume every microarchitecture can run them on any port with one cycle latency
each.

getting off topic here:

tune=generic should account for popularity of CPUs, right?  So I hope it won't
sacrifice much speed for SnB-family in order to avoid something that's slow on
Pentium4, I hope.  (e.g. P4 doesn't like inc/dec, but all other CPUs rename the
carry flag separately to avoid the false dep.  Not a great example, because
that only saves a couple code bytes.  shrd isn't a good example, because it's
slow even on AMD Bulldozer.)

Is there a tune=no_glass_jaws that *will* give up speed (or code size) for
common CPUs in order to avoid things that are *really* bad on some rare
microarchitectures, (especially old ones)?  Or maybe a tune=desktop to doesn't
care what's slow on Atom/Jaguar?  People distributing binaries that probably
won't be used on Atom/Silvermont netbooks might use that.

Anyway, I think it would be neat to have the option of making a binary that
will be quite good on SnB, not have major problems on recent AMD, but I don't
care if it has the occasional slow instruction on Atom or K8.  Or alternatively
to have a binary that doesn't suck badly anywhere.

[Bug tree-optimization/68557] Missed x86 peephole optimization for multiplying by a bool

Reply via email to