When doing some tests with my ARM64 code generator, I saw the
performances of my software drop from 75% of gcc's speed to just 60%.
What was happening?
One of the reasons (besides the nullity of my code generator of course)
was that gcc cached the values of the global "table" in registers,
reading it just once. Since there are many accesses in the most busy
function in the program, gcc speeds up considerably.
Clever, but there is a problem with that: the generated program becomes
completely thread unfriendly. It will read the value once, and then,
even if another thread modifies the value, it will use the old value.
I read always the value from memory, allowing fast access to globals
without too many locks.
Some optimizations contradict the "least surprise" principle, and I
think they are not worth the effort. They could be optional, for single
threaded programs but that decision is better to leave it at the user's
discretion and not implemented by default with -O2.
"-O2" is the standard gcc's optimization level seen since years
everywhere. Maybe it would be worth considering moving that to O4 or
even O9?
Lock operations are expensive. Access to globals can be cached only when
they are declared const, and that wasn't the case for the program being
compiled.
Suppose (one of the multiple scenarios) that you store in a set of
memory locations data like wind speed, temperature, etc. Only the thread
that updates that table acquires a lock. All others access the data
without any locks in read mode.
A program generated with this optimizations reads it once. That's a bug...
Compilers are very complex, and the function that I used was a leaf
function. Highly important functions where you tend to optimize
aggresively. They aren't supposed to last for a long time anyway, so the
caching can't hurt, and in this case was dead right since speed
increases notably.
What do you think?
jacob