https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68557
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |peter at cordes dot ca --- Comment #2 from Peter Cordes <peter at cordes dot ca> --- Besides code-size, uop-cache size is a factor for Intel CPUs. imul is only a single uop, while neg/and is 2 uops. Total number of instructions is a factor for other CPUs, too, but only locally. (Saving uop-cache space can mean speedups for *other* code that doesn't get evicted). If the operation isn't part of a long dependency chain, imul is a better choice on almost all CPUs. Let OOO execution sort it out. When latency matters some, we have to weigh the tradeoff of code-size / more insns and uops vs. slightly (or much) higher latency. Agner Fog's instruction tables indicate that 32bit imul is probably ok for tune=generic, but 64bit imul should maybe only be used with -mtune=intel (but absolutely not with tune=atom. Maybe not with tune=silvermont either, but it does have modest OOO capabilities to hide the latency. It's not as wide, so saving insns maybe matters more?). I'm not sure if tune=intel is supposed to put much weight on pre-Silvermont Atom. From Agner Fog's spreadsheet, updated 2016-Jan09: uops/m-ops latency recip-throughput execution pipe/port Intel:SnB-family(Sandybridge through Skylake) imul r32,r32: 1 3 1 p1 imul r64,r64: 1 3 1 p1 AMD:bdver1-3 imul r32,r32: 1 4 2 EX1 imul r64,r64: 1 6 4 EX1 Intel:Silvermont imul r32,r32: 1 3 1 IP0 imul r64,r64: 1 5 2 IP0 AMD:bobcat/jaguar imul r32,r32: 1 3 1 I0 imul r64,r64: 1 6 4 I0 old HW Intel:Nehalem imul r32,r32: 1 3 1 p1 imul r64,r64: 1 3 1 p0 Intel:Merom/Penryn(Core2) imul r32,r32: 1 3 1 p1 imul r64,r64: 1 5 2 p0 (same as FP mul, maybe borrows its wider multiplier?) Intel:Atom imul r32,r32: 1 5 2 Alu0,Mul imul r64,r64: 6 13 11 Alu0,Mul AMD:K8/K10 imul r32,r32: 1 3 1 ALU0 imul r64,r64: 1 4 2 ALU0_1 (uses units 0 and 1) VIA:Nano3000 imul r32,r32: 1 2 1 I2 imul r64,r64: 1 5 2 MA If gcc keeps track of execution port pressure at all, it should also avoid imul when surrounding code is multiply-heavy (or doing other stuff that also contends for the same resources as imul). I didn't check on neg/and, but I assume every microarchitecture can run them on any port with one cycle latency each. getting off topic here: tune=generic should account for popularity of CPUs, right? So I hope it won't sacrifice much speed for SnB-family in order to avoid something that's slow on Pentium4, I hope. (e.g. P4 doesn't like inc/dec, but all other CPUs rename the carry flag separately to avoid the false dep. Not a great example, because that only saves a couple code bytes. shrd isn't a good example, because it's slow even on AMD Bulldozer.) Is there a tune=no_glass_jaws that *will* give up speed (or code size) for common CPUs in order to avoid things that are *really* bad on some rare microarchitectures, (especially old ones)? Or maybe a tune=desktop to doesn't care what's slow on Atom/Jaguar? People distributing binaries that probably won't be used on Atom/Silvermont netbooks might use that. Anyway, I think it would be neat to have the option of making a binary that will be quite good on SnB, not have major problems on recent AMD, but I don't care if it has the occasional slow instruction on Atom or K8. Or alternatively to have a binary that doesn't suck badly anywhere.