On 28/04/16 09:47, Maxim Kuvyrkov wrote: >> On Apr 27, 2016, at 7:26 PM, Szabolcs Nagy <szabolcs.n...@arm.com> wrote: >> >> with -mfentry, by default the user only has to >> implement the fentry call (linux wants nops there, but >> e.g. glibc could use -pg -mfentry for profiling on >> aarch64 and the target specific details are easier to >> document for an -m option than for something general). > > I don't understand your point here, could you elaborate, please? >
if we only provide -mfentry then - the kernel can use it (they have tools to nop patch the binary), - others who don't want to fiddle with nops, just have the call, can also use it (e.g. user-space profiling cannot really use something that needs binary patching in case the user prefers -pg -mfentry over the current -pg behaviour). - it's target specific, so the magic abi of the fentry call can be documented by the target according to the specific instruction sequence that is used. (with nop-padding there are psabi and compiler optimization interactions that may be hard to document in a generic way and letting the user figure it out may cause problems later in compiler development.. but i'm just speculating based on the powerpc toc handling and ipa-ra findings.) >> the nop-padding is more general, but the size and >> layout of nops and the call abi will be target >> specific and the user will most likely need to modify >> the binary (to get the right sequence) which needs >> additional tooling. i don't know who might use it >> other than linux (which already has tools to deal with >> -mfentry). > > Right, but this tooling will require minimal (if any) changes > to be adapted to nop-pad approach. If I remember correctly, > recent versions of GCC and kernel for x86_64 generate NOPs, > not the call sequence in the prologs when -mfentry is used. i'm trying to find where this happens in the kernel, but i only see scripts/recordmcount.{c,pl} which are based on nop patching the fentry/mcount call sites. without such call sites the tools have to be implemented differently and the way the kernel records the call site positions might not match the prolog-pad recording.