Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
                 CC|                            |roger at nextmovesoftware dot 

--- Comment #1 from Roger Sayle <roger at nextmovesoftware dot com> ---
I'm surprised that the difference in performance is (so) observable.  The
sal/sub/sal sequence (which admittedly has a long dependency chain) consists of
three single-cycle latency instructions, whereas the imul is documented by
Agner Fog to also have a latency of 3.  The biggest difference may be the
number of instructions and bytes (a decoder bottleneck?).

Interestingly, if you specify -Os gcc uses the imul.

Is the register value being multiplied by 240 always 1 or 0, allowing the
hardware to invoke some form of bypass?  The constant 240 has (popcount) 4 set
bits, so implementing this in only three instructions (which may be scheduled
concurrently with other operations) is pretty impressive.  Perhaps someone can
post a microbenchmark?

Reply via email to