Am 03.10.2016 um 06:34 schrieb Zoffix Znet via RT:
Seems the issue has more to do with running an empty loop, rather than
performing a real computation.

This is a run on a 4-core box. Attempting to parallelize an empty loop makes
the execution 1 second slower:

[...]

But running actual real-life code makes it almost 4 times faster, as
would be expected on a 4-core box:

(Disclaimer: I have no ideas of the internals, but I know a bit about concurrency.)

This might be four cores competing to get update access to the loop counter.
Core-to-core synchronization of a memory cell with high-frequency updates is an extremely expensive operation, with dozens or hundreds of wait states to request exclusive cache lines access and to move the current state of the variable from one CPU's cache to the next.

Reply via email to