[RFC] Kernel livepatching support in GCC

2015-05-28 Thread Maxim Kuvyrkov
Hi,

Akashi-san and I have been discussing required GCC changes to make kernel's 
livepatching work for AArch64 and other architectures.  At the moment 
livepatching is supported for x86[_64] using the following options: "-pg 
-mfentry -mrecord-mcount -mnop-mcount" which is geek-speak for "please add 
several NOPs at the very beginning of each function, and make a section with 
addresses of all those NOP pads".

The above long-ish list of options is a historical artifact of how livepatching 
support evolved for x86.  The end result is that for livepatching (or ftrace, 
or possible future kernel features) to work compiler needs to generate a little 
bit of empty code space at the beginning of each function.  Kernel can later 
use that space to insert call sequences for various hooks.

Our proposal is that instead of adding -mfentry/-mnop-count/-mrecord-mcount 
options to other architectures, we should implement a target-independent option 
-fprolog-pad=N, which will generate a pad of N nops at the beginning of each 
function and add a section entry describing the pad similar to -mrecord-mcount 
[1].

Since adding NOPs is much less architecture-specific then outputting call 
instruction sequences, this option can be handled in a target-independent way 
at least for some/most architectures.

Comments?

As I found out today, the team from Huawei has implemented [2], which follows 
x86 example of -mfentry option generating a hard-coded call sequence.  I hope 
that this proposal can be easily incorporated into their work since most of the 
livepatching changes are in the kernel.

[1] Technically, generating a NOP pad and adding a section entry in 
.__mcount_loc are two separate actions, so we may want to have a 
-fprolog-pad-record option.  My instinct is to stick with a single option for 
now, since we can always add more later.

[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-May/346905.html

--
Maxim Kuvyrkov
www.linaro.org





Re: [RFC] Kernel livepatching support in GCC

2015-05-28 Thread Richard Biener
On May 28, 2015 10:39:27 AM GMT+02:00, Maxim Kuvyrkov 
 wrote:
>Hi,
>
>Akashi-san and I have been discussing required GCC changes to make
>kernel's livepatching work for AArch64 and other architectures.  At the
>moment livepatching is supported for x86[_64] using the following
>options: "-pg -mfentry -mrecord-mcount -mnop-mcount" which is
>geek-speak for "please add several NOPs at the very beginning of each
>function, and make a section with addresses of all those NOP pads".
>
>The above long-ish list of options is a historical artifact of how
>livepatching support evolved for x86.  The end result is that for
>livepatching (or ftrace, or possible future kernel features) to work
>compiler needs to generate a little bit of empty code space at the
>beginning of each function.  Kernel can later use that space to insert
>call sequences for various hooks.
>
>Our proposal is that instead of adding
>-mfentry/-mnop-count/-mrecord-mcount options to other architectures, we
>should implement a target-independent option -fprolog-pad=N, which will
>generate a pad of N nops at the beginning of each function and add a
>section entry describing the pad similar to -mrecord-mcount [1].
>
>Since adding NOPs is much less architecture-specific then outputting
>call instruction sequences, this option can be handled in a
>target-independent way at least for some/most architectures.
>
>Comments?

Maybe follow s390 -mhotpatch instead?

>As I found out today, the team from Huawei has implemented [2], which
>follows x86 example of -mfentry option generating a hard-coded call
>sequence.  I hope that this proposal can be easily incorporated into
>their work since most of the livepatching changes are in the kernel.
>
>[1] Technically, generating a NOP pad and adding a section entry in
>.__mcount_loc are two separate actions, so we may want to have a
>-fprolog-pad-record option.  My instinct is to stick with a single
>option for now, since we can always add more later.
>
>[2]
>http://lists.infradead.org/pipermail/linux-arm-kernel/2015-May/346905.html
>
>--
>Maxim Kuvyrkov
>www.linaro.org




Re: [RFC] Kernel livepatching support in GCC

2015-05-28 Thread Maxim Kuvyrkov
> On May 28, 2015, at 11:59 AM, Richard Biener  
> wrote:
> 
> On May 28, 2015 10:39:27 AM GMT+02:00, Maxim Kuvyrkov 
>  wrote:
>> Hi,
>> 
>> Akashi-san and I have been discussing required GCC changes to make
>> kernel's livepatching work for AArch64 and other architectures.  At the
>> moment livepatching is supported for x86[_64] using the following
>> options: "-pg -mfentry -mrecord-mcount -mnop-mcount" which is
>> geek-speak for "please add several NOPs at the very beginning of each
>> function, and make a section with addresses of all those NOP pads".
>> 
>> The above long-ish list of options is a historical artifact of how
>> livepatching support evolved for x86.  The end result is that for
>> livepatching (or ftrace, or possible future kernel features) to work
>> compiler needs to generate a little bit of empty code space at the
>> beginning of each function.  Kernel can later use that space to insert
>> call sequences for various hooks.
>> 
>> Our proposal is that instead of adding
>> -mfentry/-mnop-count/-mrecord-mcount options to other architectures, we
>> should implement a target-independent option -fprolog-pad=N, which will
>> generate a pad of N nops at the beginning of each function and add a
>> section entry describing the pad similar to -mrecord-mcount [1].
>> 
>> Since adding NOPs is much less architecture-specific then outputting
>> call instruction sequences, this option can be handled in a
>> target-independent way at least for some/most architectures.
>> 
>> Comments?
> 
> Maybe follow s390 -mhotpatch instead?

Regarding implementation of the option, it will follow what s390 is doing with 
function attributes to mark which functions to apply nop-treatment to (using 
attributes will avoid problems with [coming] LTO builds of the kernel).  The 
new option will set value of the attribute on all functions in current 
compilation unit, and then nops will be generated from the attribute 
specification.

On the other hand, s390 does not generate a section of descriptor entries of 
NOP pads, which seems like a useful (or necessary) option.  A more-or-less 
generic implementation should, therefore, combine s390's attributes approach to 
annotating functions and x86's approach to providing information in an ELF 
section about NOP entries.  Or can we record value of a function attribute in 
ELF in a generic way?

Whatever the specifics, implementation of livepatch support should be decoupled 
from -pg/mcount dependency as I don't see any real need in overloading mcount 
with livepatching stuff.

--
Maxim Kuvyrkov
www.linaro.org




Re: Relocations to use when eliding plts

2015-05-28 Thread H.J. Lu
Adding ia32/x86-64 psABI.

On Wed, May 27, 2015 at 5:44 PM, H.J. Lu  wrote:
> On Wed, May 27, 2015 at 1:03 PM, Richard Henderson  wrote:
>> There's one problem with the couple of patches that I've seen go by wrt 
>> eliding
>> PLTs with -z now, and relaxing inlined PLTs (aka -fno-plt):
>>
>> They're currently using the same relocations used by data, and thus the 
>> linker
>> and dynamic linker must ensure that pointer equality is maintained.  Which
>> results in branch-to-branch-(to-branch) situations.
>>
>> E.g. the attached test case, in which main has a plt entry for function A in
>> a.so, and the function B in b.so calls A.
>>
>> $ LD_BIND_NOW=1 gdb main
>> ...
>> (gdb) b b
>> Breakpoint 1 at 0x400540
>> (gdb) run
>> Starting program: /home/rth/x/main
>> Breakpoint 1, b () at b.c:2
>> 2   void b(void) { a(); }
>> (gdb) si
>> 2   void b(void) { a(); }
>> => 0x77bf75f4 :callq  0x77bf74e0
>> (gdb)
>> 0x77bf74e0 in ?? () from ./b.so
>> => 0x77bf74e0:  jmpq   *0x20034a(%rip)# 0x77df7830
>> (gdb)
>> 0x00400560 in a@plt ()
>> => 0x400560 :jmpq   *0x20057a(%rip)# 0x600ae0
>> (gdb)
>> a () at a.c:2
>> 2   void a() { printf("Hello, World!\n"); }
>> => 0x77df95f0 :  sub$0x8,%rsp
>>
>>
>> If we use -fno-plt, we eliminate the first callq, but do still have two
>> consecutive jmpq's.

You get consecutive jmpq's because x86 PLT entry is used as the
canonical function address.  If you compile main with -fno-plt -fPIE, you
get:

(gdb) b b
Breakpoint 1 at 0x77bf75f0: file b.c, line 4.
(gdb) r
Starting program: /export/home/hjl/bugs/binutils/pr18458/main

Breakpoint 1, b () at b.c:4
4 {
(gdb) si
5  a();
(gdb)
a () at a.c:4
4 {
(gdb)

>> If seems to me that we ought to have different relocations when we're only
>> going to use a pointer for branching, and when we need a pointer to be
>> canonicalized for pointer comparisons.
>>
>> In the linked image, we already have these: R_X86_64_GLOB_DAT vs
>> R_X86_64_JUMP_SLOT.  Namely, GLOB_DAT implies "data" (and therefore pointer
>> equality), while JUMP_SLOT implies "code" (and therefore we can resolve past
>> plt stubs in the main executable).
>>
>> Which means that HJ's patch of May 16 (git hash 25070364), is less than 
>> ideal.
>>  I do like the smaller PLT entries, but I don't like the fact that it now 
>> emits
>> GLOB_DAT for the relocations instead of JUMP_SLOT.
>
> ld.so just does whatever is arranged by ld.  I am not sure change ld.so
> is a good idea.  I don't what kind of optimization we can do when function
> is called and its address it taken.
>
>>
>> In the relocatable image, when we're talking about -fno-plt, we should think
>> about what relocation we'd like to emit.  Yes, the existing R_X86_64_GOTPCREL
>> works with existing toolchains, and there's something to be said for that.
>> However, if we're talking about adding a new relocation for relaxing an
>> indirect call via GOTPCREL, then:
>>
>> If we want -fno-plt to be able to hoist function addresses, then we're going 
>> to
>> want the address that we load for the call to also not be subject to possible
>> jump-to-jump.
>>
>> Unless we want the linker to do an unreasonable amount of x86 code 
>> examination
>> in order to determine mov vs call for relaxation, we need two different
>> relocations (preferably using the same assembler mnemonic, and thus the 
>> correct
>> relocation is enforced by the assembler).
>>
>> On the users/hjl/relax branch (and posted on list somewhere), the new
>> relocation is called R_X86_64_RELAX_GOTPCREL.  I'm not keen on that "relax"
>> name, despite that being exactly what it's for.
>>
>> I suggest R_X86_64_GOTPLTPCREL_{CALL,LOAD} for the two relocation names.  
>> That
>> is, the address is in the .got.plt section, it's a pc-relative relocation, 
>> and
>> it's being used by a call or load (mov) insn.
>
> Since it is used for indirect call, how about R_X86_64_INBR_GOTPCREL?
>
> I updated users/hjl/relax branch to covert relocation in *foo@GOTPCREL(%rip)
> from R_X86_64_GOTPCREL to R_X86_64_RELAX_GOTPCREL so that
> existing assembly code works automatically with a new binutils.
>
>> With those two, we can fairly easily relax call/jmp to direct branches, and 
>> mov
>> to lea.  Yes, LTO can perform the same optimization, but I'll also agree that
>> there are many projects for which LTO is both overkill and unworkable.
>>
>> This does leave open other optimization questions, mostly around weak
>> functions.  Consider constructs like
>>
>> if (foo) foo();
>>
>> Do we, within the compiler, try to CSE GOTPCREL and GOTPLTPCREL, accepting 
>> the
>> possibility (not certainty) of jump-to-jump but definitely avoiding a 
>> separate
>> load insn and the latency implied by that?
>>
>>
>> Comments?

Here is the new proposal to add R_X86_64_INDBR_GOTPCREL and
R_386_INDBR_GOT32.  Comparing against the last proposal, I used
_INDBR_, instead of _RELAX_, and  also I used the same assembler
mnemonic.

Better info for combine results in worse code generated

2015-05-28 Thread Alan Modra
It's really annoying when you fix a combine bug and get worse code..

The following was part of a larger patch.  What this part does is
simply not throw away useful info about non-zero bits.  The upper bits
that "we don't know anything about" are not those indicated by
last_set_mode, because nonzero_bits_mode may have been used to
calculate last_set_nonzero_bits.  See record_value_for_reg.

Index: combine.c
===
--- combine.c   (revision 223725)
+++ combine.c   (working copy)
@@ -9832,10 +9832,16 @@ reg_nonzero_bits_for_combine (const_rtx x, machine
   REGNO (x)
 {
   unsigned HOST_WIDE_INT mask = rsp->last_set_nonzero_bits;
+  machine_mode mask_mode = rsp->last_set_mode;
 
-  if (GET_MODE_PRECISION (rsp->last_set_mode) < GET_MODE_PRECISION (mode))
+  /* We possibly calculated last_set_nonzero_bits in a wider mode.  */
+  if (GET_MODE_CLASS (mask_mode) == MODE_INT
+ && GET_MODE_PRECISION (mask_mode) < HOST_BITS_PER_WIDE_INT)
+   mask_mode = nonzero_bits_mode;
+
+  if (GET_MODE_PRECISION (mask_mode) < GET_MODE_PRECISION (mode))
/* We don't know anything about the upper bits.  */
-   mask |= GET_MODE_MASK (mode) ^ GET_MODE_MASK (rsp->last_set_mode);
+   mask |= GET_MODE_MASK (mode) ^ GET_MODE_MASK (mask_mode);
 
   *nonzero &= mask;
   return NULL;

The problem is that the following testcase on powerpc64le now
generates worse code.

void foo (signed char *p) { if (*p != 0) *p = 1; }

before  after
foo:foo:
lbz 9,0(3)  lbz 9,0(3)
cmpwi 7,9,0 andi. 10,9,0xff
beqlr 7 beqlr 0
li 9,1  li 9,1
stb 9,0(3)  stb 9,0(3)
blr blr

That record form andi. is slower on many processors, and restricted to
setting cr0.


This is what combine sees at the start of the function.

(insn 6 3 7 2 (set (reg:QI 158 [ *p_3(D) ])
(mem:QI (reg/v/f:DI 156 [ p ]) [0 *p_3(D)+0 S1 A8])) byte.c:1 444 
{*movqi_internal}
 (nil))
(insn 7 6 8 2 (set (reg:SI 157 [ *p_3(D) ])
(sign_extend:SI (reg:QI 158 [ *p_3(D) ]))) byte.c:1 30 {extendqisi2}
 (expr_list:REG_DEAD (reg:QI 158 [ *p_3(D) ])
(nil)))
(insn 8 7 9 2 (set (reg:CC 159)
(compare:CC (reg:SI 157 [ *p_3(D) ])
(const_int 0 [0]))) byte.c:1 690 {*cmpsi_internal1}
 (expr_list:REG_DEAD (reg:SI 157 [ *p_3(D) ])
(nil)))
(jump_insn 9 8 10 2 (set (pc)
(if_then_else (eq (reg:CC 159)
(const_int 0 [0]))
(label_ref:DI 15)
(pc))) byte.c:1 723 {*rs6000.md:11429}
 (expr_list:REG_DEAD (reg:CC 159)
(int_list:REG_BR_PROB 3900 (nil)))
 -> 15)

And here's where things go wrong.

Trying 7 -> 8:
Successfully matched this instruction:
(set (reg:CC 159)
(compare:CC (zero_extend:SI (reg:QI 158 [ *p_3(D) ]))
(const_int 0 [0])))
allowing combination of insns 7 and 8
original costs 4 + 4 = 8
replacement cost 8
deferring deletion of insn with uid = 7.
modifying insn i3 8: {r159:CC=cmp(zero_extend(r158:QI),0);clobber scratch;}
  REG_DEAD r158:QI
deferring rescan insn with uid = 8.

Trying 6 -> 8:
Failed to match this instruction:
(parallel [
(set (reg:CC 159)
(compare:CC (subreg:SI (mem:QI (reg/v/f:DI 156 [ p ]) [0 *p_3(D)+0 
S1 A8]) 0)
(const_int 0 [0])))
(clobber (scratch:SI))
])
Failed to match this instruction:
(set (reg:CC 159)
(compare:CC (subreg:SI (mem:QI (reg/v/f:DI 156 [ p ]) [0 *p_3(D)+0 S1 A8]) 
0)
(const_int 0 [0])))


With an unpatched compiler the 7 -> 8 combination doesn't happen,
because the less accurate zero-bits info doesn't allow the sign_extend
to be removed.  Instead, you get

Trying 7 -> 8:
Failed to match this instruction:
(set (reg:CC 159)
(compare:CC (reg:QI 158 [ *p_3(D) ])
(const_int 0 [0])))

Trying 6, 7 -> 8:
Failed to match this instruction:
(set (reg:CC 159)
(compare:CC (zero_extend:SI (mem:QI (reg/v/f:DI 156 [ p ]) [0 *p_3(D)+0 S1 
A8]))
(const_int 0 [0])))
Successfully matched this instruction:
(set (reg:SI 157 [ *p_3(D) ])
(zero_extend:SI (mem:QI (reg/v/f:DI 156 [ p ]) [0 *p_3(D)+0 S1 A8])))
Successfully matched this instruction:
(set (reg:CC 159)
(compare:CC (reg:SI 157 [ *p_3(D) ])
(const_int 0 [0])))
allowing combination of insns 6, 7 and 8
original costs 8 + 4 + 4 = 16
replacement costs 8 + 4 = 12
deferring deletion of insn with uid = 6.
modifying insn i2 7: r157:SI=zero_extend([r156:DI])
deferring rescan insn with uid = 7.
modifying insn i3 8: r159:CC=cmp(r157:SI,0)
  REG_DEAD r157:SI
deferring rescan insn with uid = 8.

So, a three insn combine that's split to two insns.  Improving the
non-zero bit info loses the opportunity to try this three insn
combination, because we've already reduced down to two insns.

D

Re: Better info for combine results in worse code generated

2015-05-28 Thread David Edelsohn
On Thu, May 28, 2015 at 10:39 AM, Alan Modra  wrote:

> The problem is that the following testcase on powerpc64le now
> generates worse code.
>
> void foo (signed char *p) { if (*p != 0) *p = 1; }
>
> before  after
> foo:foo:
> lbz 9,0(3)  lbz 9,0(3)
> cmpwi 7,9,0 andi. 10,9,0xff
> beqlr 7 beqlr 0
> li 9,1  li 9,1
> stb 9,0(3)  stb 9,0(3)
> blr blr
>
> That record form andi. is slower on many processors, and restricted to
> setting cr0.

> allowing combination of insns 6, 7 and 8
> original costs 8 + 4 + 4 = 16
> replacement costs 8 + 4 = 12

> Does anyone have any clues as to how I might fix this?  I'm not keen
> on adding an insn_and_split to rs6000.md to recognize the 6 -> 8
> combination, because one of the aims of the wider patch I was working
> on was to remove patterns like rotlsi3_64, ashlsi3_64, lshrsi3_64 and
> ashrsi3_64.  Adding patterns in order to remove others doesn't sound
> like much of a win.

This seems like a problem with the cost model.  Rc instructions are
more expensive and should be represented as such in rtx_costs.

- David


Re: Better info for combine results in worse code generated

2015-05-28 Thread Alan Modra
On Thu, May 28, 2015 at 10:47:53AM -0400, David Edelsohn wrote:
> This seems like a problem with the cost model.  Rc instructions are
> more expensive and should be represented as such in rtx_costs.

The record instructions do have a higher cost (8 vs. 4 for normal
insns).  If the cost is increaed I don't think you'll see them
generated at all, which would fix my testcase but probably regress
others.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Relocations to use when eliding plts

2015-05-28 Thread Richard Henderson
On 05/28/2015 04:27 AM, H.J. Lu wrote:
> You get consecutive jmpq's because x86 PLT entry is used as the
> canonical function address.  If you compile main with -fno-plt -fPIE, you
> get:

Well, duh.  If the main executable has no PLTs, they aren't used as the
canonical function address.  Surely you aren't proposing that as a solution?


r~


Re: [RFC] Kernel livepatching support in GCC

2015-05-28 Thread Andreas Krebbel
On 05/28/2015 11:16 AM, Maxim Kuvyrkov wrote:
>> On May 28, 2015, at 11:59 AM, Richard Biener  
>> wrote:
...
>> Maybe follow s390 -mhotpatch instead?
> 
> Regarding implementation of the option, it will follow what s390 is doing 
> with function attributes to mark which functions to apply nop-treatment to 
> (using attributes will avoid problems with [coming] LTO builds of the 
> kernel).  The new option will set value of the attribute on all functions in 
> current compilation unit, and then nops will be generated from the attribute 
> specification.
> 
> On the other hand, s390 does not generate a section of descriptor entries of 
> NOP pads, which seems like a useful (or necessary) option.  A more-or-less 
> generic implementation should, therefore, combine s390's attributes approach 
> to annotating functions and x86's approach to providing information in an ELF 
> section about NOP entries.  Or can we record value of a function attribute in 
> ELF in a generic way?

I agree that would be useful. The only reason we didn't implement that was that 
our kernel guys were
confident enough to be able to detect patchable functions without it. We 
discussed two solutions to
that:

1. Add special relocations pointing to the patchable areas.
2. Add a special section listing all patchable areas. I think systemtap is 
doing something similiar
for their probes.

> Whatever the specifics, implementation of livepatch support should be 
> decoupled from -pg/mcount dependency as I don't see any real need in 
> overloading mcount with livepatching stuff.

Agreed.

For user space hotpatching we also needed hotpatching areas *before* the 
function label to emit
trampolines there.  This perhaps should be covered by a generic approach as 
well.

-Andreas-



Re: Relocations to use when eliding plts

2015-05-28 Thread H.J. Lu
On Thu, May 28, 2015 at 8:29 AM, Richard Henderson  wrote:
> On 05/28/2015 04:27 AM, H.J. Lu wrote:
>> You get consecutive jmpq's because x86 PLT entry is used as the
>> canonical function address.  If you compile main with -fno-plt -fPIE, you
>> get:
>
> Well, duh.  If the main executable has no PLTs, they aren't used as the
> canonical function address.  Surely you aren't proposing that as a solution?
>

I was just explaining where those consecutive jmpq's came from.
I wasn't suggesting a solution..


-- 
H.J.


Re: Relocations to use when eliding plts

2015-05-28 Thread Richard Henderson
On 05/28/2015 08:42 AM, H.J. Lu wrote:
> On Thu, May 28, 2015 at 8:29 AM, Richard Henderson  wrote:
>> On 05/28/2015 04:27 AM, H.J. Lu wrote:
>>> You get consecutive jmpq's because x86 PLT entry is used as the
>>> canonical function address.  If you compile main with -fno-plt -fPIE, you
>>> get:
>>
>> Well, duh.  If the main executable has no PLTs, they aren't used as the
>> canonical function address.  Surely you aren't proposing that as a solution?
>>
> 
> I was just explaining where those consecutive jmpq's came from.
> I wasn't suggesting a solution..

I did explain it.  In the quite long message.

No comments about the rest of it, wherein I suggest a solution that doesn't
require the main executable to be compiled with -fno-plt in order to avoid them?


r~


Re: Relocations to use when eliding plts

2015-05-28 Thread Jakub Jelinek
On Thu, May 28, 2015 at 08:52:28AM -0700, Richard Henderson wrote:
> On 05/28/2015 08:42 AM, H.J. Lu wrote:
> > On Thu, May 28, 2015 at 8:29 AM, Richard Henderson  wrote:
> >> On 05/28/2015 04:27 AM, H.J. Lu wrote:
> >>> You get consecutive jmpq's because x86 PLT entry is used as the
> >>> canonical function address.  If you compile main with -fno-plt -fPIE, you
> >>> get:
> >>
> >> Well, duh.  If the main executable has no PLTs, they aren't used as the
> >> canonical function address.  Surely you aren't proposing that as a 
> >> solution?
> >>
> > 
> > I was just explaining where those consecutive jmpq's came from.
> > I wasn't suggesting a solution..
> 
> I did explain it.  In the quite long message.
> 
> No comments about the rest of it, wherein I suggest a solution that doesn't
> require the main executable to be compiled with -fno-plt in order to avoid 
> them?

And even that wouldn't help, you'd need to compile the binaries with -fpie 
-fno-plt,
as -fno-plt doesn't affect normal non-PIC calls.

Jakub


Re: Relocations to use when eliding plts

2015-05-28 Thread H.J. Lu
On Thu, May 28, 2015 at 9:02 AM, Jakub Jelinek  wrote:
> On Thu, May 28, 2015 at 08:52:28AM -0700, Richard Henderson wrote:
>> On 05/28/2015 08:42 AM, H.J. Lu wrote:
>> > On Thu, May 28, 2015 at 8:29 AM, Richard Henderson  wrote:
>> >> On 05/28/2015 04:27 AM, H.J. Lu wrote:
>> >>> You get consecutive jmpq's because x86 PLT entry is used as the
>> >>> canonical function address.  If you compile main with -fno-plt -fPIE, you
>> >>> get:
>> >>
>> >> Well, duh.  If the main executable has no PLTs, they aren't used as the
>> >> canonical function address.  Surely you aren't proposing that as a 
>> >> solution?
>> >>
>> >
>> > I was just explaining where those consecutive jmpq's came from.
>> > I wasn't suggesting a solution..
>>
>> I did explain it.  In the quite long message.
>>
>> No comments about the rest of it, wherein I suggest a solution that doesn't
>> require the main executable to be compiled with -fno-plt in order to avoid 
>> them?
>
> And even that wouldn't help, you'd need to compile the binaries with -fpie 
> -fno-plt,
> as -fno-plt doesn't affect normal non-PIC calls.
>

Funny you should mention it.  Here is a patch to extend -fno-plt
to normal non-PIC calls.  64-bit works with the current binutils.  32-bit
only works with users/hjl/relax branch.  I need to add configure test
to enable it for 32-bit.


-- 
H.J.
---
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index e77cd04..db7ce3d 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -25611,7 +25611,22 @@ ix86_output_call_insn (rtx_insn *insn, rtx call_op)
   if (SIBLING_CALL_P (insn))
 {
   if (direct_p)
- xasm = "%!jmp\t%P0";
+ {
+  if (!flag_plt
+  && !flag_pic
+  && !TARGET_MACHO
+  && !TARGET_SEH
+  && !TARGET_PECOFF)
+{
+  /* Avoid PLT.  */
+  if (TARGET_64BIT)
+ xasm = "%!jmp\t*%p0@GOTPCREL(%%rip)";
+  else
+ xasm = "%!jmp\t*%p0@GOT";
+}
+  else
+xasm = "%!jmp\t%P0";
+ }
   /* SEH epilogue detection requires the indirect branch case
  to include REX.W.  */
   else if (TARGET_SEH)
@@ -25654,7 +25669,22 @@ ix86_output_call_insn (rtx_insn *insn, rtx call_op)
 }

   if (direct_p)
-xasm = "%!call\t%P0";
+{
+  if (!flag_plt
+  && !flag_pic
+  && !TARGET_MACHO
+  && !TARGET_SEH
+  && !TARGET_PECOFF)
+ {
+  /* Avoid PLT.  */
+  if (TARGET_64BIT)
+xasm = "%!call\t*%p0@GOTPCREL(%%rip)";
+  else
+xasm = "%!call\t*%p0@GOT";
+ }
+  else
+ xasm = "%!call\t%P0";
+}
   else
 xasm = "%!call\t%A0";


Re: Better info for combine results in worse code generated

2015-05-28 Thread David Edelsohn
On Thu, May 28, 2015 at 11:13 AM, Alan Modra  wrote:
> On Thu, May 28, 2015 at 10:47:53AM -0400, David Edelsohn wrote:
>> This seems like a problem with the cost model.  Rc instructions are
>> more expensive and should be represented as such in rtx_costs.
>
> The record instructions do have a higher cost (8 vs. 4 for normal
> insns).  If the cost is increaed I don't think you'll see them
> generated at all, which would fix my testcase but probably regress
> others.

It still seems to be a cost issue.  You and I "know" that "andi." is
more "expensive" in this context.  Somehow we need to teach GCC that
the instruction is "expensive" in this situation but useful in others.
I'm not sure that we want to inhibit GCC combine from making the
transformation for other reasons because it otherwise is correct.

Thanks, David


Re: Relocations to use when eliding plts

2015-05-28 Thread Rich Felker
On Thu, May 28, 2015 at 08:29:31AM -0700, Richard Henderson wrote:
> On 05/28/2015 04:27 AM, H.J. Lu wrote:
> > You get consecutive jmpq's because x86 PLT entry is used as the
> > canonical function address.  If you compile main with -fno-plt -fPIE, you
> > get:
> 
> Well, duh.  If the main executable has no PLTs, they aren't used as the
> canonical function address.  Surely you aren't proposing that as a solution?

Why not? Is there a way we could prevent the main program from having
PLT even when it's non-PIE? Instead of:

call foo

the compiler could generate

call *foo@GOTABS_RELAXABLE

Then the linker would replace this with "call foo" if foo is defined
in the main program. For address loads, instead of:

mov $foo, %eax

or:

lea foo, %eax

you would have:

mov foo@GOTABS_RELAXABLE, %eax

and the linker could likewise relax this to an immediate mov. More
elaborate arithmetic on the function address might be hard to do in an
efficient but relaxable way; however, I don't think the compiler ever
needs to do that, and if it did, there would just be a few odd cases
that still generate PLT thunks.

Am I missing something?

Rich


Re: [RFC] Kernel livepatching support in GCC

2015-05-28 Thread Andi Kleen
> Our proposal is that instead of adding -mfentry/-mnop-count/-mrecord-mcount 
> options to other architectures, we should implement a target-independent 
> option -fprolog-pad=N, which will generate a pad of N nops at the beginning 
> of each function and add a section entry describing the pad similar to 
> -mrecord-mcount [1].

Sounds fine to me.

-Andi


-- 
a...@linux.intel.com -- Speaking for myself only


Re: Relocations to use when eliding plts

2015-05-28 Thread Richard Henderson

On 05/28/2015 10:59 AM, Rich Felker wrote:

Am I missing something?


You're not missing anything.  But do you want the performance of a library to 
depend on how the main executable is compiled?



r~


Re: Relocations to use when eliding plts

2015-05-28 Thread Rich Felker
On Thu, May 28, 2015 at 11:41:10AM -0700, Richard Henderson wrote:
> On 05/28/2015 10:59 AM, Rich Felker wrote:
> >Am I missing something?
> 
> You're not missing anything.  But do you want the performance of a
> library to depend on how the main executable is compiled?

Not directly. But I'd rather be in that situation than have
pessimizations in library codegen to avoid it. I'm worried about cases
where code both loads the address of a function and calls it, such as
this (stupid) example:

a((void *)a);

Would having separate handling of the address-for-call and
address-for-function-pointer result in the compiler emitting 2
separate GOT loads (and consuming 2 registers) here in an effort to
avoid the possibility of inefficiency from a PLT thunk in the main
program?

In my vision, main programs are always or almost-always (e.g. just
exceptions for stuff like emacs) PIE and the PLT-in-main-program issue
is a non-issue, so I don't want to risk hurting codegen on the library
side just to make a legacy usage (non-PIE) mildly more efficient.

Rich


Re: Relocations to use when eliding plts

2015-05-28 Thread Jakub Jelinek
On Thu, May 28, 2015 at 03:29:02PM -0400, Rich Felker wrote:
> > You're not missing anything.  But do you want the performance of a
> > library to depend on how the main executable is compiled?
> 
> Not directly. But I'd rather be in that situation than have
> pessimizations in library codegen to avoid it. I'm worried about cases
> where code both loads the address of a function and calls it, such as
> this (stupid) example:
> 
>   a((void *)a);

That can be handled by using just one GOT slot, the non-.got.plt one;
only if there are only relocations that guarantee that address equality is
not important it would use the faster (*_JUMP_SLOT?) relocations.

> In my vision, main programs are always or almost-always (e.g. just
> exceptions for stuff like emacs) PIE and the PLT-in-main-program issue
> is a non-issue, so I don't want to risk hurting codegen on the library
> side just to make a legacy usage (non-PIE) mildly more efficient.

Calling non-PIEs legacy is maybe your vision, but there will always be a very
good reason for non-PIEs.  And, even PIEs don't really help you, there is no
reason why even in PIEs code couldn't use (if it doesn't already) RIP
relative relocations for functions not known to be defined in the current
TU; it can just refer to the PLT slots like it does in non-PIE binaries.

Jakub


Re: Better info for combine results in worse code generated

2015-05-28 Thread Segher Boessenkool
On Fri, May 29, 2015 at 12:09:42AM +0930, Alan Modra wrote:
> It's really annoying when you fix a combine bug and get worse code..

Heh.  You've been on the receiving end of that a lot lately :-/

> void foo (signed char *p) { if (*p != 0) *p = 1; }
> 
>   before  after
> foo:  foo:
>   lbz 9,0(3)  lbz 9,0(3)
>   cmpwi 7,9,0 andi. 10,9,0xff
>   beqlr 7 beqlr 0
>   li 9,1  li 9,1
>   stb 9,0(3)  stb 9,0(3)
>   blr blr
> 
> That record form andi. is slower on many processors,

Is it?  On which processors?

> and restricted to setting cr0.

Yes.  If it is allocated a different crn it is split to a rlwinm and a
cmpw, but that is much too late for the rlwinm to be combined with the
lbz again.

> one of the aims of the wider patch I was working
> on was to remove patterns like rotlsi3_64, ashlsi3_64, lshrsi3_64 and
> ashrsi3_64.

We will need such patterns no matter what; the compiler cannot magically
know what machine insns set the high bits of a 64-bit reg to zero.

We should have something nicer than the current duplication though.  Maybe
define_subst can help.  Maybe something a little bit more powerful than
that is needed though.


Segher


Re: Relocations to use when eliding plts

2015-05-28 Thread Rich Felker
On Thu, May 28, 2015 at 09:40:57PM +0200, Jakub Jelinek wrote:
> On Thu, May 28, 2015 at 03:29:02PM -0400, Rich Felker wrote:
> > > You're not missing anything.  But do you want the performance of a
> > > library to depend on how the main executable is compiled?
> > 
> > Not directly. But I'd rather be in that situation than have
> > pessimizations in library codegen to avoid it. I'm worried about cases
> > where code both loads the address of a function and calls it, such as
> > this (stupid) example:
> > 
> > a((void *)a);
> 
> That can be handled by using just one GOT slot, the non-.got.plt one;
> only if there are only relocations that guarantee that address equality is
> not important it would use the faster (*_JUMP_SLOT?) relocations.

How far would this extend, e.g. in the case of LTO or compiling the
whole library at once?

> > In my vision, main programs are always or almost-always (e.g. just
> > exceptions for stuff like emacs) PIE and the PLT-in-main-program issue
> > is a non-issue, so I don't want to risk hurting codegen on the library
> > side just to make a legacy usage (non-PIE) mildly more efficient.
> 
> Calling non-PIEs legacy is maybe your vision, but there will always be a very
> good reason for non-PIEs.  And, even PIEs don't really help you, there is no
> reason why even in PIEs code couldn't use (if it doesn't already) RIP
> relative relocations for functions not known to be defined in the current
> TU; it can just refer to the PLT slots like it does in non-PIE binaries.

I agree completely. I don't think support for non-PIE should be
removed for anything, just that if there are tradeoffs between
optimizing non-PIE and optimizing PIE/PIC, we should opt for the
latter since it's the direction things should take moving forward.

Rich


gcc-4.8-20150528 is now available

2015-05-28 Thread gccadmin
Snapshot gcc-4.8-20150528 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.8-20150528/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.8 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_8-branch 
revision 223852

You'll find:

 gcc-4.8-20150528.tar.bz2 Complete GCC

  MD5=f4860311415f5c54a7674c773b91a32a
  SHA1=b272c445dff65eb6b9c0cc3904ac6199623ad8b0

Diffs from 4.8-20150521 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.8
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: [RFC] Kernel livepatching support in GCC

2015-05-28 Thread Jim Wilson
On 05/28/2015 01:39 AM, Maxim Kuvyrkov wrote:
> Hi,
> 
> Akashi-san and I have been discussing required GCC changes to make kernel's 
> livepatching work for AArch64 and other architectures.  At the moment 
> livepatching is supported for x86[_64] using the following options: "-pg 
> -mfentry -mrecord-mcount -mnop-mcount" which is geek-speak for "please add 
> several NOPs at the very beginning of each function, and make a section with 
> addresses of all those NOP pads".

FYI, there is also the darwin/rs6000 -mfix-and-continue support, which
adds 5 nops to the prologue.  This was a part of a gdb feature, to allow
one to load a fixed function into a binary inside the debugger, and then
continue executing with the fixed code.  It sounds like your kernel
feature is doing something very similar.  If you are making this a
generic feature, then maybe the darwin/rs6000 -mfix-and-continue support
can be merged with it somehow.

Jim



avr non-optimal optimization

2015-05-28 Thread Ralph Doncaster
I tried compiling the following code with  -mmcu=attiny13a  -Os -flto
using 4.8 and 5.1:

#define NOT_A_REG 0

#define digitalPinToPortReg(PIN) \
( ((PIN) >= 0 && (PIN) <= 7) ? &PORTB : NOT_A_REG)

#define digitalPinToBit(P) ((P) & 7)

#define HIGH 1
#define LOW 0

inline void digitalWrite(int pin, int state)
{
if ( state & 1 ) {
// set pin
*(digitalPinToPortReg(pin)) |= (1 << digitalPinToBit(pin));
}
if ( !(state & 1 ) ) {
// clear pin
*(digitalPinToPortReg(pin)) &= ~(1 << digitalPinToBit(pin));
}
}

void main()
{
digitalWrite(3,SREG);
}

Both compiled to the same assembler code:
  22:   0f b6   in  r0, 0x3f; 63
  24:   00 fe   sbrsr0, 0
  26:   02 c0   rjmp.+4 ; 0x2c 
  28:   c3 9a   sbi 0x18, 3 ; 24
  2a:   08 95   ret
  2c:   c3 98   cbi 0x18, 3 ; 24
  2e:   08 95   ret

However the following code is 1 instruction shorter (and 1 cycle
faster when other code follows that would require rjmp +2 at 2a):
in r0, 0x3f
sbrs r0, 0
cbi 0x13, 3
sbrc r0, 0
sbi 0x13, 3
ret


Re: Better info for combine results in worse code generated

2015-05-28 Thread Alan Modra
On Thu, May 28, 2015 at 02:42:22PM -0500, Segher Boessenkool wrote:
> > That record form andi. is slower on many processors,
> 
> Is it?  On which processors?

That sort of info is in the IBM confidential processor book4
supplements.  So I can't tell you.  (I think it is completely crazy to
keep information out of the hands of engineers, but my opinion doesn't
count for much..)  I'll tell you one of the reasons why they are
slower, as any decent hardware engineer could probably figure this out
themselves anyway.  The record form instructions are cracked into two
internal ops, the basic arithmetic/logic op, and a compare.  There's a
limit to how much hardware can do in one clock cycle, or conversely,
if you try to do more your clock must be slower.

> > one of the aims of the wider patch I was working
> > on was to remove patterns like rotlsi3_64, ashlsi3_64, lshrsi3_64 and
> > ashrsi3_64.
> 
> We will need such patterns no matter what; the compiler cannot magically
> know what machine insns set the high bits of a 64-bit reg to zero.

No, not by magic.  I define EXTEND_OP in rs6000.h and use it in
record_value_for_reg.  Full patch follows.  I see enough code gen
improvements on powerpc64le to make this patch worth pursuing,
things like "rlwinm 0,5,6,0,25; extsw 0,0" being converted to
"rldic 0,5,6,52".  No doubt due to being able to prove an int var
doesn't have the sign bit set.  Hmm, in fact the 52 says it is
known to be only 6 bits before shifting.

Index: combine.c
===
--- combine.c   (revision 223725)
+++ combine.c   (working copy)
@@ -1739,7 +1739,7 @@ set_nonzero_bits_and_sign_copies (rtx x, const_rtx
 
   if (set == 0 || GET_CODE (set) == CLOBBER)
{
- rsp->nonzero_bits = GET_MODE_MASK (GET_MODE (x));
+ rsp->nonzero_bits = ~(unsigned HOST_WIDE_INT) 0;
  rsp->sign_bit_copies = 1;
  return;
}
@@ -1769,7 +1769,7 @@ set_nonzero_bits_and_sign_copies (rtx x, const_rtx
  break;
  if (!link)
{
- rsp->nonzero_bits = GET_MODE_MASK (GET_MODE (x));
+ rsp->nonzero_bits = ~(unsigned HOST_WIDE_INT) 0;
  rsp->sign_bit_copies = 1;
  return;
}
@@ -1788,7 +1788,7 @@ set_nonzero_bits_and_sign_copies (rtx x, const_rtx
update_rsp_from_reg_equal (rsp, insn, set, x);
   else
{
- rsp->nonzero_bits = GET_MODE_MASK (GET_MODE (x));
+ rsp->nonzero_bits = ~(unsigned HOST_WIDE_INT) 0;
  rsp->sign_bit_copies = 1;
}
 }
@@ -9832,10 +9832,16 @@ reg_nonzero_bits_for_combine (const_rtx x, machine
   REGNO (x)
 {
   unsigned HOST_WIDE_INT mask = rsp->last_set_nonzero_bits;
+  machine_mode mask_mode = rsp->last_set_mode;
 
-  if (GET_MODE_PRECISION (rsp->last_set_mode) < GET_MODE_PRECISION (mode))
+  /* We possibly calculated last_set_nonzero_bits in a wider mode.  */
+  if (GET_MODE_CLASS (mask_mode) == MODE_INT
+ && GET_MODE_PRECISION (mask_mode) < HOST_BITS_PER_WIDE_INT)
+   mask_mode = nonzero_bits_mode;
+
+  if (GET_MODE_PRECISION (mask_mode) < GET_MODE_PRECISION (mode))
/* We don't know anything about the upper bits.  */
-   mask |= GET_MODE_MASK (mode) ^ GET_MODE_MASK (rsp->last_set_mode);
+   mask |= GET_MODE_MASK (mode) ^ GET_MODE_MASK (mask_mode);
 
   *nonzero &= mask;
   return NULL;
@@ -9852,16 +9858,8 @@ reg_nonzero_bits_for_combine (const_rtx x, machine
   return tem;
 }
   else if (nonzero_sign_valid && rsp->nonzero_bits)
-{
-  unsigned HOST_WIDE_INT mask = rsp->nonzero_bits;
+*nonzero &= rsp->nonzero_bits;
 
-  if (GET_MODE_PRECISION (GET_MODE (x)) < GET_MODE_PRECISION (mode))
-   /* We don't know anything about the upper bits.  */
-   mask |= GET_MODE_MASK (mode) ^ GET_MODE_MASK (GET_MODE (x));
-
-  *nonzero &= mask;
-}
-
   return NULL;
 }
 
@@ -9883,7 +9881,11 @@ reg_num_sign_bit_copies_for_combine (const_rtx x,
 
   rsp = ®_stat[REGNO (x)];
   if (rsp->last_set_value != 0
-  && rsp->last_set_mode == mode
+  && (rsp->last_set_mode == mode
+ || (GET_MODE_CLASS (rsp->last_set_mode) == MODE_INT
+ && GET_MODE_CLASS (mode) == MODE_INT
+ && (GET_MODE_PRECISION (mode)
+ <= GET_MODE_PRECISION (rsp->last_set_mode
   && ((rsp->last_set_label >= label_tick_ebb_start
   && rsp->last_set_label < label_tick)
  || (rsp->last_set_label == label_tick
@@ -9895,7 +9897,12 @@ reg_num_sign_bit_copies_for_combine (const_rtx x,
  (DF_LR_IN (ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb),
   REGNO (x)
 {
-  *result = rsp->last_set_sign_bit_copies;
+  int signbits = rsp->last_set_sign_bit_copies;
+  signbits -= (GET_MODE_PRECISION (rsp->last_set_mode)
+  - GET_MODE_PRECISION (mode));
+  if (signbits <= 0)
+   signbit

Identifying Chain of Recurrence

2015-05-28 Thread Pritam Gharat
GCC builds a chain of recurrence to capture a pattern in which an
array is accessed in a loop. Is there any function which identifies
that gcc has built a chain of recurrence? Is this information
associated to the gimple assignment which accesses the array elements?


Thanks,
Pritam Gharat


Re: Identifying Chain of Recurrence

2015-05-28 Thread Bin.Cheng
On Fri, May 29, 2015 at 12:41 PM, Pritam Gharat
 wrote:
> GCC builds a chain of recurrence to capture a pattern in which an
> array is accessed in a loop. Is there any function which identifies
> that gcc has built a chain of recurrence? Is this information
> associated to the gimple assignment which accesses the array elements?
>
GCC analyzes evolution of scalar variables on SSA representation.
Each ssa var is treated as unique and can be analyzed.  If the address
expression itself is a ssa var, it can be analyzed by scev; otherwise,
the users of scev have to compute by themselves on the bases of other
scev vars.  For example, address of MEM[scev_var] can be analyzed by
scev; while address of MEM[invariant_base+scev_iv] is computed by
users.  Well, at least this is how IVOPT works.  You can refer to
top-file comment in tree-scalar-evolution.c for detailed information.
General routines for chain of recurrence is in file tree-chrec.c.

Thanks,
bin
>
> Thanks,
> Pritam Gharat