On Sat, Aug 26, 2017 at 1:23 AM, Michael Clark <michaeljcl...@mac.com> wrote: > Dear GCC folk, > I have to say that’s GCC’s -Os caught me by surprise after several years > using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year and > a half I have been working on RISC-V development and have been exclusively > using GCC for RISC-V builds, and initially I was using -Os. After performing > a qualitative/quantitative assessment I don’t believe GCC’s current -Os is > particularly useful, at least for my needs as it doesn’t provide a > commensurate saving in size given the sometimes quite huge drop in > performance. > > I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC > frustration thread, as I think Apple’s documentation which presumably > documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps > using -O2 as a starting point) with the idea that the current -Os is renamed > to -Oz. > > -Oz > (APPLE ONLY) Optimize for size, regardless of performance. -Oz > enables the same optimization flags that -Os uses, but -Oz also > enables other optimizations intended solely to reduce code > size. > In particular, instructions that encode into fewer bytes are > preferred over longer instructions that execute in fewer > cycles. > -Oz on Darwin is very similar to -Os in FSF distributions of > GCC. > -Oz employs the same inlining limits and avoids string > instructions > just like -Os. > > -Os > Optimize for size, but not at the expense of speed. -Os > enables all > -O2 optimizations that do not typically increase code size. > However, instructions are chosen for best performance, > regardless > of size. To optimize solely for size on Darwin, use -Oz (APPLE > ONLY). > > I have recently been working on a benchmark suite to test a RISC-V JIT > engine. I have performed all testing using GCC 7.1 as the baseline compiler, > and during the process I have collected several performance metrics, some > that are neutral to the JIT runtime environment. In particular I have made > performance comparisons between -Os and -O3 on x86, along with capturing > executable file sizes, dynamic retired instruction and micro-op counts for > x86, dynamic retired instruction counts for RISC-V as well as dynamic > register and instruction usage histograms for RISC-V, for both -Os and -O3. > > See the Optimisation section for a charted performance comparison between -O3 > and -Os. There are dozens of other plots that show the differences between > -Os and -O3. > > - https://rv8.io/bench > > The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The > Geomean of course smooths over some pathological cases where -Os performance > is severely degraded versus -O3 but not with significant, or commensurate > savings in size.
First let me put into some perspective on -Os usage and some history: 1) -Os is not useful for non-embedded users 2) the embedded folks really need the smallest code possible and usually will be willing to afford the performance hit 3) -Os was a mistake for Apple to use in the first place; they used it and then GCC got better for PowerPC to use the string instructions which is why -Oz was added :) 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications. Comparing -O3 to -Os is not totally fair on x86 due to the many different instructions and encodings. Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a big issue. I soon have a need to keep overall (bare-metal) application size down to just 256k. Micro-controllers are places where -Os matters the most. > > I don’t currently have -O2 in my results however it seems like I should add > -O2 to the benchmark suite. If you take a look at the web page you’ll see > that there is already a huge amount of data given we have captured dynamic > register frequencies and dynamic instruction frequencies for -Os and -O3. The > tables and charts are all generated by scripts so if there is interest I > could add -O2. I can also pretty easily perform runs with new compiler > versions as everything is completely automated. The biggest factor is that it > currently takes 4 hours for a full run as we run all of the benchmarks in a > simulator to capture dynamic register usage and dynamic instruction usage. > > After looking at the results, one has to question the utility of -Os in its > present form, and indeed question how it is actually used in practice, given > the proportion of savings in executable size. After my assessment I would not > recommend anyone to use -Os because its savings in size are not proportionate > to the loss in performance. I feel discouraged from using it after looking at > the results. I really don’t believe -Os makes the right trades e.g. reducing > icache pressure can indeed lead to better performance due to reduced code > size. This comment does not help my application usage. It rather hurts it and goes against what -Os is really about. It is not about reducing icache pressure but overall application code size. I really need the code to fit into a specific size. Thanks, Andrew > > I also wonder whether -O2 level optimisations may be a good starting point > for a more useful -Os and how one would proceed towards selecting > optimisations to add back to -Os to increase its usability, or rename the > current -Os to -Oz and make -Os an alias for -O2. A similar profile to -O2 > would probably produce less shock for anyone who does quantitative > performance analysis of -Os. > > In fact there are some interesting issues for the RISC-V backend given the > assembler performs RVC compression and GCC doesn’t really see the size of > emitted instructions. It would be an interesting backend to investigate > improving -Os presuming that a backend can opt in to various optimisations > for a given optimisation level. RISC-V would gain most of its size and > runtime icache pressure reduction improvements by getting the highest > frequency registers allocated within the 8 register set that is accessible by > the RVC instructions. Merely controlling register allocation to favour the > RVC accessible registers would produce the largest savings in executable > size, and may indeed be good for performance due to reduced icache pressure. > > I have Dynamic Register Frequency Charts but they are not presently labeled > or coloured whether the registers are RVC accessible registers (x8 to x15). I > did however work on some crude ASCII histograms that indicate register access > frequency and whether the register is RVC accessible. Ideally the register > allocator would allocate highest frequency registers first from the RVC set. > The register order is already correctly defined in the RISC-V backend. I have > been experimenting with riscv_register_priority to try to nudge LRA but have > not yet had success. riscv_register_priority currently returns 1 for RVC > registers (if the C extension is present) and 0 for regular registers however > the loop frequency information is obviously not accurate enough or LRA does > not completely honour the register order and priority. It’s likely it may not > make a lot of difference on platforms with very regular register files. See > this gist for one of the benchmarks register access frequency labeled as to > whether the register is accessible from compressed instructions: > > - https://gist.github.com/michaeljclark/8ba727e56084833e4f838c941eeca6be > > Question. Who uses -Os on GCC? > > I have for many years used -Os on macOS for Clang builds, as it has been an > Xcode default, but I’m considering using -O2 instead of -Os with FSF GCC. I > was using FSF GCC’s -Os under the mistaken impression that it operates > similarly to -Os in Xcode. i.e. produces code that performs well. > > In any case, despite my rant, I hope the quantitative states in the link > above prove to be useful. > > Thanks and Regards, > Michael.