https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115756
Roger Sayle <roger at nextmovesoftware dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |roger at nextmovesoftware dot com --- Comment #1 from Roger Sayle <roger at nextmovesoftware dot com> --- I'm surprised that the difference in performance is (so) observable. The sal/sub/sal sequence (which admittedly has a long dependency chain) consists of three single-cycle latency instructions, whereas the imul is documented by Agner Fog to also have a latency of 3. The biggest difference may be the number of instructions and bytes (a decoder bottleneck?). Interestingly, if you specify -Os gcc uses the imul. Is the register value being multiplied by 240 always 1 or 0, allowing the hardware to invoke some form of bypass? The constant 240 has (popcount) 4 set bits, so implementing this in only three instructions (which may be scheduled concurrently with other operations) is pretty impressive. Perhaps someone can post a microbenchmark?