https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285
Wilco <wilco at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |wilco at gcc dot gnu.org --- Comment #2 from Wilco <wilco at gcc dot gnu.org> --- (In reply to Bu Le from comment #0) > Created attachment 48584 [details] > proposed patch > > I would like to propose an implementation of the medium code model in > aarch64. A prototype is attached, passed bootstrap and the regression test. > > Mcmodel = medium is a missing code model in aarch64 architecture, which is > supported in x86. This code model describes a situation that some small data > is relocated by small code model while large data is relocated by large code > model. The official statement about medium code model in x86 ABI file page > 34 URL : https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf > > The key difference between x86 and aarch64 is that x86 can use lea+movabs > instruction to implement a dynamic relocatable large code model. Currently, > large code model in AArch64 relocate the symbol using ldr instruction, which > can only be static linked. However, the small code mode use adrp + ldr > instruction, which can be dynamic linked. Therefore, the medium code model > cannot be implemented directly by simply setting a threshold. As a result a > dynamic reloadable large code model is needed first for a functional medium > code model. > > I met this problem when compiling CESM, which is a climate forecast software > that widely used in hpc field. In some configure case, when the manipulating > large arrays, the large code model with dynamic relocation is needed. The > following case is abstract from CESM for this scenario. > > program main > common/baz/a,b,c > real a,b,c > b = 1.0 > call foo() > print*, b > end > > subroutine foo() > common/baz/a,b,c > real a,b,c > > integer, parameter :: nx = 1024 > integer, parameter :: ny = 1024 > integer, parameter :: nz = 1024 > integer, parameter :: nf = 1 > real :: bar(nf,nx*ny*nz) > real :: bar1(nf,nx*ny*nz) > bar = 0.0 > bar1 =0.0 > b = bar(1,1024*1024*100) > b = bar1(1,1) > > return > end > > compile with -mcmodel=small -fPIC will give following error due to the > access of bar1 array > test.f90:(.text+0x28): relocation truncated to fit: > R_AARCH64_ADR_PREL_PG_HI21 against `.bss' > test.f90:(.text+0x6c): relocation truncated to fit: > R_AARCH64_ADR_PREL_PG_HI21 against `.bss' > > compile with -mcmodel=large -fPIC will give unsupported error: > f951: sorry, unimplemented: code model ‘large’ with ‘-fPIC’ > > As discussed in the beginning, to tackle this problem we have to solve the > static large code model problem. My solution here is to use > R_AARCH64_MOVW_PREL_Gx group relocation with instructions to calculate the > current PC value. > > Before change (mcmodel=small) : > adrp x0, bar1.2782 > add x0, x0, :lo12:bar1.2782 > > After change:(mcmodel = medium proposed): > movz x0, :prel_g3:bar1.2782 > movk x0, :prel_g2_nc:bar1.2782 > movk x0, :prel_g1_nc:bar1.2782 > movk x0, :prel_g0_nc:bar1.2782 > adr x1, . > sub x1, x1, 0x4 > add x0, x0, x1 > > The first 4 movk instruction will calculate the offset between bar1 and the > last movk instruction in 64-bits, which fulfil the requirement of large code > model(64-bit relocation). > The adr+sub instruction will calculate the pc-address of the last movk > instruction. By adding the offset with the PC address, bar1 can be > dynamically located. > > Because this relocation is time consuming, a threshold is set to classify > the size of the data to be relocated, like x86. The default value of the > threshold is set to 65536, which is max relocation capability of small code > model. > This implementation will also need to amend the linker in binutils so that > the4 movk can calculated the same pc-offset of the last movk instruction. > > The good side of this implementation is that it can use existed relocation > type to prototype a medium code model. > > The drawback of this implementation also exists. > For start, these 4movk instructions and the adr instruction must be combined > in this order. No other instruction should insert in between the sequence, > which will leads to mistake symbol address. This might impede the insn > schedule optimizations. > Secondly, the linker need to make the change correspondingly so that every > mov instruction calculate the same pc-offset. For example, in my > implementation, the fisrt movz instruction will need to add 12 to the result > of ":prel_g3:bar1.2782" to make up the pc-offset. > > I haven't figure out a suitable solution for these problems yet. You are > most welcomed to leave your suggestions regarding these issues. Is the main usage scenario huge arrays? If so, these could easily be allocated via malloc at startup rather than using bss. It means an extra indirection in some cases (to load the pointer), but it should be much more efficient than using a large code model with all the overheads.