https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
Bug ID: 114010 Summary: Unwanted effects of using SSA free lists. Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: manolis.tsamis at vrull dot eu Target Milestone: --- Created attachment 57469 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57469&action=edit Testcase and example compilations on AArch64 After some experiments, I have observed a number of unwanted effects due to re-using SSA names through free lists. Specifically, the issue is that the recycled SSA names have unpredictable ordering and the compiler is affected by this ordering. The effects include: 1) Worse code generation from some passes that explicitly or implicitly depend on SSA name ordering. The most important case I have observed is that the vectorizer can fail or create inferior code with more shuffles/moves when the SSA names aren't monotonically increasing. 2) Missed canonicalization opportunities. Functions that do the same thing but have different SSA ordering will usually end up diverging after various optimization passes. The generated assembly code is also much different when it shouldn't. Since the SSA freelists are affected by things like moving around dead code this also makes comparing GIMPLE or assembly code much harder than necessary. 3) Because freelists persist through optimization passes, their hidden state can affect optimizations/codegen in unexpected ways. I have seen two similar source files generating the exact same GIMPLE code up to some optimization pass but then completely diverging due to different freelists. Getting something completely different by starting from the same GIMPLE was mildly surprising at first. 4) Although I haven't observed something worthwhile on that front, in theory if a large chunk of names is free'd (e.g. dead code elimination) then the pending free'd names will negatively affect performance and memory because they're not compacted. This mostly affects all the bitmaps that use SSA_NAME_VERSION as a key. The easier way to experiment with this behaviour is code like below: ...some code... if (condition that will be eliminated, but later) { ...some function call that will be inlined now... ...or some code... } ...some other code... I have crafted and attached a testcase (canon.cpp) that can showcase some of these behaviors at different optimization levels. As part of the experiments I have a modified GCC compiler that calls `release_free_names_and_compact_live_names` after every optimization pass, which is enough to guarantee that the generated names are in monotonically increasing order. Together with the testcase I have attached the AArch64 assembly code generated at 8 different configurations: with or without the dead code in the source file, with or without free name canonicalization, with O2 or with O3. The observations at O2: As per the previous notes, it can be seen that code in with-dead-code.O2.txt is quite different from without-dead-code.O2.txt (different instruction mix, different instruction count, different register allocation etc.). This is due to the free'd up names causing different SSA name allocation. Compared to that with-dead-code.canonicalized.O2.txt and without-dead-code.canonicalized.O2.txt are identical except for the stack offsets chosen for the variables. It can be easily seen as it diffs cleanly, both at the gimple and assembly level. The observations at O3: At O3 the assembly no longer cleanly diffs in either case, but it is interesting to observe the better vectorization done due to the better ordering of names (in almost all places that there is vectorized code): E.g. non-canonicalized: .L37: ld4 {v24.16b - v27.16b}, [x1], 64 shl v23.16b, v24.16b, 5 add v28.16b, v23.16b, v24.16b mov v23.16b, v18.16b mla v23.16b, v17.16b, v25.16b mov v29.16b, v23.16b mov v23.16b, v20.16b mla v23.16b, v19.16b, v26.16b mov v30.16b, v23.16b mov v23.16b, v22.16b mla v23.16b, v21.16b, v27.16b mov v31.16b, v23.16b st4 {v28.16b - v31.16b}, [x0], 64 cmp x3, x1 bne .L37 Assembly for the same code, 3 instructions less: .L37: ld4 {v28.16b - v31.16b}, [x1], 64 mov v26.16b, v23.16b mov v27.16b, v21.16b shl v25.16b, v28.16b, 5 mla v26.16b, v24.16b, v29.16b mla v27.16b, v22.16b, v30.16b add v25.16b, v25.16b, v28.16b mov v28.16b, v19.16b mla v28.16b, v20.16b, v31.16b st4 {v25.16b - v28.16b}, [x0], 64 cmp x1, x3 bne .L37 E.g. another loop, non canonicalized names: .L120: ldr q30, [x0], 16 movi v29.2s, 0 ld2 {v26.16b - v27.16b}, [x4], 32 movi v25.4s, 0 zip1 v29.16b, v30.16b, v29.16b zip2 v30.16b, v30.16b, v25.16b umlal v29.8h, v26.8b, v28.8b umlal2 v30.8h, v26.16b, v28.16b uaddw v31.4s, v31.4s, v29.4h uaddw2 v31.4s, v31.4s, v29.8h uaddw v31.4s, v31.4s, v30.4h uaddw2 v31.4s, v31.4s, v30.8h cmp x5, x0 bne .L120 Assembly for the same loop, 2 instructions less (no use of zip1/2): .L120: ld2 {v30.16b - v31.16b}, [x4], 32 ldr q28, [x0], 16 umull v31.8h, v30.8b, v27.8b umull2 v30.8h, v30.16b, v27.16b uaddw v31.8h, v31.8h, v28.8b uaddw2 v30.8h, v30.8h, v28.16b uaddw v29.4s, v29.4s, v31.4h uaddw2 v29.4s, v29.4s, v31.8h uaddw v29.4s, v29.4s, v30.4h uaddw2 v29.4s, v29.4s, v30.8h cmp x0, x5 bne .L120 I haven't found any compile time regressions from compacting SSA names after each optimization pass. Although we can search for every place that is affected by the ordering of names and try to fix it up, it looks like compacting SSA names and/or sorting the freelists is a good start.