Am 03.10.2016 um 06:34 schrieb Zoffix Znet via RT:
Seems the issue has more to do with running an empty loop, rather than
performing a real computation.
This is a run on a 4-core box. Attempting to parallelize an empty loop makes
the execution 1 second slower:
[...]
But running actual real-life code makes it almost 4 times faster, as
would be expected on a 4-core box:
(Disclaimer: I have no ideas of the internals, but I know a bit about
concurrency.)
This might be four cores competing to get update access to the loop counter.
Core-to-core synchronization of a memory cell with high-frequency
updates is an extremely expensive operation, with dozens or hundreds of
wait states to request exclusive cache lines access and to move the
current state of the variable from one CPU's cache to the next.