When doing some tests with my ARM64 code generator, I saw the performances of my software drop from 75% of gcc's speed to just 60%.

What was happening?

One of the reasons (besides the nullity of my code generator of course) was that gcc cached the values of the global "table" in registers, reading it just once. Since there are many accesses in the most busy function in the program, gcc speeds up considerably.

Clever, but there is a problem with that: the generated program becomes completely thread unfriendly. It will read the value once, and then, even if another thread modifies the value, it will use the old value.

I read always the value from memory, allowing fast access to globals without too many locks.

Some optimizations contradict the "least surprise" principle, and I think they are not worth the effort. They could be optional, for single threaded programs but that decision is better to leave it at the user's discretion and not implemented by default with -O2.

"-O2" is the standard gcc's optimization level seen since years everywhere. Maybe it would be worth considering moving that to O4 or even O9?

Lock operations are expensive. Access to globals can be cached only when they are declared const, and that wasn't the case for the program being compiled.

Suppose (one of the multiple scenarios) that you store in a set of memory locations data like wind speed, temperature, etc. Only the thread that updates that table acquires a lock. All others access the data without any locks in read mode.

A program generated with this optimizations reads it once. That's a bug...

Compilers are very complex, and the function that I used was a leaf function. Highly important functions where you tend to optimize aggresively. They aren't supposed to last for a long time anyway, so the caching can't hurt, and in this case was dead right since speed increases notably.

What do you think?


jacob



Reply via email to