Re: Quantitative analysis of -Os vs -O3

Andrew Pinski Sat, 26 Aug 2017 01:39:58 -0700

On Sat, Aug 26, 2017 at 1:23 AM, Michael Clark <michaeljcl...@mac.com> wrote:
> Dear GCC folk,
> I have to say that’s GCC’s -Os caught me by surprise after several years 
> using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year and 
> a half I have been working on RISC-V development and have been exclusively 
> using GCC for RISC-V builds, and initially I was using -Os. After performing 
> a qualitative/quantitative assessment I don’t believe GCC’s current -Os is 
> particularly useful, at least for my needs as it doesn’t provide a 
> commensurate saving in size given the sometimes quite huge drop in 
> performance.
>
> I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC 
> frustration thread, as I think Apple’s documentation which presumably 
> documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps 
> using -O2 as a starting point) with the idea that the current -Os is renamed 
> to -Oz.
>
>         -Oz
>                (APPLE ONLY) Optimize for size, regardless of performance. -Oz
>                enables the same optimization flags that -Os uses, but -Oz also
>                enables other optimizations intended solely to reduce code 
> size.
>                In particular, instructions that encode into fewer bytes are
>                preferred over longer instructions that execute in fewer 
> cycles.
>                -Oz on Darwin is very similar to -Os in FSF distributions of 
> GCC.
>                -Oz employs the same inlining limits and avoids string 
> instructions
>                just like -Os.
>
>         -Os
>                Optimize for size, but not at the expense of speed. -Os 
> enables all
>                -O2 optimizations that do not typically increase code size.
>                However, instructions are chosen for best performance, 
> regardless
>                of size. To optimize solely for size on Darwin, use -Oz (APPLE
>                ONLY).
>
> I have recently  been working on a benchmark suite to test a RISC-V JIT 
> engine. I have performed all testing using GCC 7.1 as the baseline compiler, 
> and during the process I have collected several performance metrics, some 
> that are neutral to the JIT runtime environment. In particular I have made 
> performance comparisons between -Os and -O3 on x86, along with capturing 
> executable file sizes, dynamic retired instruction and micro-op counts for 
> x86, dynamic retired instruction counts for RISC-V as well as dynamic 
> register and instruction usage histograms for RISC-V, for both -Os and -O3.
>
> See the Optimisation section for a charted performance comparison between -O3 
> and -Os. There are dozens of other plots that show the differences between 
> -Os and -O3.
>
>         - https://rv8.io/bench
>
> The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The 
> Geomean of course smooths over some pathological cases where -Os performance 
> is severely degraded versus -O3 but not with significant, or commensurate 
> savings in size.



First let me put into some perspective on -Os usage and some history:
1) -Os is not useful for non-embedded users
2) the embedded folks really need the smallest code possible and
usually will be willing to afford the performance hit
3) -Os was a mistake for Apple to use in the first place; they used it
and then GCC got better for PowerPC to use the string instructions
which is why -Oz was added :)
4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.

Comparing -O3 to -Os is not totally fair on x86 due to the many
different instructions and encodings.
Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
big issue.
I soon have a need to keep overall (bare-metal) application size down
to just 256k.
Micro-controllers are places where -Os matters the most.

>
> I don’t currently have -O2 in my results however it seems like I should add 
> -O2 to the benchmark suite. If you take a look at the web page you’ll see 
> that there is already a huge amount of data given we have captured dynamic 
> register frequencies and dynamic instruction frequencies for -Os and -O3. The 
> tables and charts are all generated by scripts so if there is interest I 
> could add -O2. I can also pretty easily perform runs with new compiler 
> versions as everything is completely automated. The biggest factor is that it 
> currently takes 4 hours for a full run as we run all of the benchmarks in a 
> simulator to capture dynamic register usage and dynamic instruction usage.
>
> After looking at the results, one has to question the utility of -Os in its 
> present form, and indeed question how it is actually used in practice, given 
> the proportion of savings in executable size. After my assessment I would not 
> recommend anyone to use -Os because its savings in size are not proportionate 
> to the loss in performance. I feel discouraged from using it after looking at 
> the results. I really don’t believe -Os makes the right trades e.g. reducing 
> icache pressure can indeed lead to better performance due to reduced code 
> size.

This comment does not help my application usage.  It rather hurts it
and goes against what -Os is really about.  It is not about reducing
icache pressure but overall application code size.  I really need the
code to fit into a specific size.

Thanks,
Andrew

>
> I also wonder whether -O2 level optimisations may be a good starting point 
> for a more useful -Os and how one would proceed towards selecting 
> optimisations to add back to -Os to increase its usability, or rename the 
> current -Os to -Oz and make -Os an alias for -O2. A similar profile to -O2 
> would probably produce less shock for anyone who does quantitative 
> performance analysis of -Os.
>
> In fact there are some interesting issues for the RISC-V backend given the 
> assembler performs RVC compression and GCC doesn’t really see the size of 
> emitted instructions. It would be an interesting backend to investigate 
> improving -Os presuming that a backend can opt in to various optimisations 
> for a given optimisation level. RISC-V would gain most of its size and 
> runtime icache pressure reduction improvements by getting the highest 
> frequency registers allocated within the 8 register set that is accessible by 
> the RVC instructions. Merely controlling register allocation to favour the 
> RVC accessible registers would produce the largest savings in executable 
> size, and may indeed be good for performance due to reduced icache pressure.
>
> I have Dynamic Register Frequency Charts but they are not presently labeled 
> or coloured whether the registers are RVC accessible registers (x8 to x15). I 
> did however work on some crude ASCII histograms that indicate register access 
> frequency and whether the register is RVC accessible. Ideally the register 
> allocator would allocate highest frequency registers first from the RVC set. 
> The register order is already correctly defined in the RISC-V backend. I have 
> been experimenting with riscv_register_priority to try to nudge LRA but have 
> not yet had success. riscv_register_priority currently returns 1 for RVC 
> registers (if the C extension is present) and 0 for regular registers however 
> the loop frequency information is obviously not accurate enough or LRA does 
> not completely honour the register order and priority. It’s likely it may not 
> make a lot of difference on platforms with very regular register files. See 
> this gist for one of the benchmarks register access frequency labeled as to 
> whether the register is accessible from compressed instructions:
>
> - https://gist.github.com/michaeljclark/8ba727e56084833e4f838c941eeca6be
>
> Question. Who uses -Os on GCC?
>
> I have for many years used -Os on macOS for Clang builds, as it has been an 
> Xcode default, but I’m considering using -O2 instead of -Os with FSF GCC. I 
> was using FSF GCC’s -Os under the mistaken impression that it operates 
> similarly to -Os in Xcode. i.e. produces code that performs well.
>
> In any case, despite my rant, I hope the quantitative states in the link 
> above prove to be useful.
>
> Thanks and Regards,
> Michael.

Re: Quantitative analysis of -Os vs -O3

Reply via email to