Missed optimizations at -Os
Hi, For this (reduced) test case extern int x, y, z; void foo(void); void bar(void); void blah(void); void test (void) { int flag = 0; flag = ((x && y) || z); if (flag && x && y) { bar(); } } I expected gcc -Os (x86_64, if it matters) to generate code equivalent to if (x && y) bar(); Instead, I get test(): mov eax, DWORD PTR x[rip] testeax, eax je .L2 cmp DWORD PTR y[rip], 0 jne .L3 .L2: cmp DWORD PTR z[rip], 0 je .L1 testeax, eax je .L1 .L3: cmp DWORD PTR y[rip], 0 je .L1 jmp bar() .L1: ret At -O1 and above though, I get what I expected. At -O3 test(): mov edx, DWORD PTR x[rip] testedx, edx je .L1 mov eax, DWORD PTR y[rip] testeax, eax je .L1 jmp bar() .L1: rep ret Tracing through the dumps, I see that dom2 is where the gimple starts diverging. At -O3, dom2 clones the bb that tests z into two copies, and I guess that enables jump threading and subsequent dse to optimize away the second (redundant) check for x and y, as also the check for z. At -Os, dom2 doesn't attemp the bb clone as it thinks it would increase code size. I have two questions. 1. Is the analysis right? Is there anything that can be done to fix this? 2. If nothing can be done to fix this, is there some pass that can rewire goto in to goto ? Regards Senthil .dom2 at Os test () { int x.1_1; int y.2_2; int z.3_3; int y.5_4; _Bool _8; _Bool _9; _Bool _10; [100.00%]: x.1_1 = x; if (x.1_1 != 0) goto ; [50.00%] else goto ; [50.00%] [50.00%]: y.2_2 = y; if (y.2_2 != 0) goto ; [50.00%] else goto ; [50.00%] [75.00%]: z.3_3 = z; _8 = x.1_1 != 0; _9 = z.3_3 != 0; _10 = _8 & _9; if (_10 != 0) goto ; [25.60%] else goto ; [74.40%] [35.37%]: y.5_4 = y; if (y.5_4 != 0) goto ; [48.99%] else goto ; [51.01%] [17.33%]: bar (); [100.00%]: return; } .dom2 at O3 test () { int x.1_1; int y.2_2; int z.3_11; _Bool _12; _Bool _13; _Bool _14; int y.5_15; int z.3_16; _Bool _17; _Bool _18; _Bool _19; int y.5_20; [100.00%]: x.1_1 = x; if (x.1_1 != 0) goto ; [50.00%] else goto ; [50.00%] [50.00%]: y.2_2 = y; if (y.2_2 != 0) goto ; [50.00%] else goto ; [50.00%] [100.00%]: return; [50.00%]: z.3_11 = z; _12 = x.1_1 != 0; _13 = z.3_11 != 0; _14 = _12 & _13; goto ; [100.00%] [18.04%]: y.5_15 = y; goto ; [100.00%] [25.00%]: z.3_16 = z; _17 = x.1_1 != 0; _18 = z.3_16 != 0; _19 = _17 & _18; if (_19 != 0) goto ; [72.17%] else goto ; [27.83%] [25.00%]: y.5_20 = y; bar (); goto ; [100.00%] }
Re: Missed optimizations at -Os
On Tue, 17 Jan 2017, Senthil Kumar Selvaraj wrote: > Hi, > > For this (reduced) test case > > > extern int x, y, z; > void foo(void); > void bar(void); > void blah(void); > > void test (void) > { > int flag = 0; > flag = ((x && y) || z); > > if (flag && x && y) { > bar(); > } > } > > I expected gcc -Os (x86_64, if it matters) to generate code equivalent to > > if (x && y) > bar(); > > > Instead, I get > > test(): > mov eax, DWORD PTR x[rip] > testeax, eax > je .L2 > cmp DWORD PTR y[rip], 0 > jne .L3 > .L2: > cmp DWORD PTR z[rip], 0 > je .L1 > testeax, eax > je .L1 > .L3: > cmp DWORD PTR y[rip], 0 > je .L1 > jmp bar() > .L1: > ret > > At -O1 and above though, I get what I expected. At -O3 > test(): > mov edx, DWORD PTR x[rip] > testedx, edx > je .L1 > mov eax, DWORD PTR y[rip] > testeax, eax > je .L1 > jmp bar() > .L1: > rep ret > > > Tracing through the dumps, I see that dom2 is where the gimple starts > diverging. At -O3, dom2 clones the bb that tests z into two copies, and > I guess that enables jump threading and subsequent dse to optimize away the > second (redundant) check for x and y, as also the check for z. At -Os, > dom2 doesn't attemp the bb clone as it thinks it would increase code > size. > > I have two questions. > > 1. Is the analysis right? Is there anything that can be done to fix this? > > 2. If nothing can be done to fix this, is there some pass that can > rewire goto in to goto ? We're missing a pass that does predicate simplification combining both CFG and stmt form. if-combine is supposed to catch some cases, reassoc catches some others. Richard. > Regards > Senthil > > > .dom2 at Os > > > test () > { > int x.1_1; > int y.2_2; > int z.3_3; > int y.5_4; > _Bool _8; > _Bool _9; > _Bool _10; > >[100.00%]: > x.1_1 = x; > if (x.1_1 != 0) > goto ; [50.00%] > else > goto ; [50.00%] > >[50.00%]: > y.2_2 = y; > if (y.2_2 != 0) > goto ; [50.00%] > else > goto ; [50.00%] > >[75.00%]: > z.3_3 = z; > _8 = x.1_1 != 0; > _9 = z.3_3 != 0; > _10 = _8 & _9; > if (_10 != 0) > goto ; [25.60%] > else > goto ; [74.40%] > >[35.37%]: > y.5_4 = y; > if (y.5_4 != 0) > goto ; [48.99%] > else > goto ; [51.01%] > >[17.33%]: > bar (); > >[100.00%]: > return; > > } > > .dom2 at O3 > > > test () > { > int x.1_1; > int y.2_2; > int z.3_11; > _Bool _12; > _Bool _13; > _Bool _14; > int y.5_15; > int z.3_16; > _Bool _17; > _Bool _18; > _Bool _19; > int y.5_20; > >[100.00%]: > x.1_1 = x; > if (x.1_1 != 0) > goto ; [50.00%] > else > goto ; [50.00%] > >[50.00%]: > y.2_2 = y; > if (y.2_2 != 0) > goto ; [50.00%] > else > goto ; [50.00%] > >[100.00%]: > return; > >[50.00%]: > z.3_11 = z; > _12 = x.1_1 != 0; > _13 = z.3_11 != 0; > _14 = _12 & _13; > goto ; [100.00%] > >[18.04%]: > y.5_15 = y; > goto ; [100.00%] > >[25.00%]: > z.3_16 = z; > _17 = x.1_1 != 0; > _18 = z.3_16 != 0; > _19 = _17 & _18; > if (_19 != 0) > goto ; [72.17%] > else > goto ; [27.83%] > >[25.00%]: > y.5_20 = y; > bar (); > goto ; [100.00%] > > } > > > > -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
Re: make[1]: *** wait: No child processes during make -j8 check
On 01/16/2017 05:37 PM, Martin Sebor wrote: I've run into this failure during make check in the past with a very large make -j value (such as -j128), but today I've had two consecutive make check runs fail with -j12 and -j8 on my 8 core laptop with no much else going on. The last thing running was the go test suite. Has something changed recently that could be behind it? (My user process limit is 62863.) What version of make are you using? There was a bug where make could lose track of children a while back. Ask Patsy -- she may remember the RHEL BZ and from that we can probably extract the version #s that were affected and see if they match the version you're running. jeff
we can submit your scenario to more than 2,500 lenders today
Now you can benefit from our network of over 2,400 lenders and fund managers. If you have a scenario (comm. or residential) that you can't seem to get funded using your regular channels, why not send it to us and we'll try and match your deal to a lender who can get it closed. Our lenders don't charge anything before closing and you are 100% protected. Just reply with active scenarios or summaries attached. To have your email removed, just reply with remove in the subject.
Re: make[1]: *** wait: No child processes during make -j8 check
On 01/17/2017 08:30 AM, Jeff Law wrote: On 01/16/2017 05:37 PM, Martin Sebor wrote: I've run into this failure during make check in the past with a very large make -j value (such as -j128), but today I've had two consecutive make check runs fail with -j12 and -j8 on my 8 core laptop with no much else going on. The last thing running was the go test suite. Has something changed recently that could be behind it? (My user process limit is 62863.) What version of make are you using? There was a bug where make could lose track of children a while back. Ask Patsy -- she may remember the RHEL BZ and from that we can probably extract the version #s that were affected and see if they match the version you're running. I have make 4.0 and I'm running Fedora 23. I haven't changed anything on the machine in a while and used to be able to run make -j16. I still can bootstrap with that but make check has been failing recently. Three times yesterday (-12, -j8, and -j12 again), but then the last one with -j12 completed. The frustrating thing is that after the failure I have to restart make check from scratch, and so it can waste hours. Martin jeff
Re: GCC libatomic ABI specification draft
On Thu, 2016-11-17 at 12:12 -0800, Bin Fan wrote: > On 11/14/2016 4:34 PM, Bin Fan wrote: > > Hi All, > > > > I have an updated version of libatomic ABI specification draft. Please > > take a look to see if it matches GCC implementation. The purpose of > > this document is to establish an official GCC libatomic ABI, and allow > > compatible compiler and runtime implementations on the affected > > platforms. Thanks for the update, and sorry for the late reply. Comments below. > > - Rewrite section 3 to replace "lock-free" operations with "hardware > > backed" instructions. The digest of this section is: 1) inlineable > > atomics must be implemented with the hardware backed atomic > > instructions. 2) for non-inlineable atomics, the compiler must > > generate a runtime call, and the runtime support function is free to > > use any implementation. OK. I still think that using hardware-backed instructions for a particular type requires that there is a true atomic load instruction for that type. Emulating a load with an idempotent store (eg, cmpxchg16b) is not useful, overall. One could argue that an idempotent atomic HW store such as a cmpxchg16b in a loop is indeed lock-free. However, IMO the intention behind "lock-free" atomics in C and C++ is to offer atomics that are both lock-free *and* as fast as one would assume for a fully HW-backed solution for atomic accesses. This includes that loads must be cheaper than stores, in particular under contention / concurrent accesses by several threads. I believe that "fast" is much more often part of the motivation for using lock-free atomics than the actual "lock-free", so the progress-guarantee aspect (which isn't even lock-free but obstruction-free, see below). If we do see a sufficiently strong need for lock-free atomics, which should build something just for that (eg, if removing the address-free requirement, we can support lock-free (in the progress-guarantee sense) operations for a lot more types). Also, while that previous issue is "just" a performance issue, the fact that we could issue a store when calling to atomic_load() is a correctness issue, I think. One example are volatile atomic loads; while C/C++ don't really constrain what a volatile load needs to be in the underlying implementation, I think most users would assume that a load really means a hardware load instruction of some sort, and nothing else. cmpxchg16b conflicts with such an assumption. Another example is read-only mapped memory. Bottom line: we shouldn't rely solely on cmpxchg16b and similar. (Though this doesn't necessarily mean that there can't be compiler flags that enable its use.) I think the ABI should set a baseline for each architecture, and the baseline decides whether something is inlinable or not. Thus, the x86_64 ABI would make __int128 operations not imlinable (because of the issues with cmpxchg16b, see above). If users want to use capabilities beyond the baseline, they can choose to use flags that alter/extend the ABI. For example, if they use a flag that explicitly enables the use of cmpxchg16b for atomics, they also need to use a libatomic implementation built in the same way (if possible). This then creates a new ABI(-variant), basically. I've made a few tests on my x86_64 machine a few weeks ago, and I didn't see cmpxchg16b being used. IIRC, I also looked at libatomic and didn't see it (but I don't remember for sure). Either way, if I should have been wrong, and we are using cmpxchg16b for loads, this should be fixed. Ideally, this should be fixed before the stage 3 deadline this Friday. Such a fix might potentially break existing uses, but the earlier we fix this, the better. Section 3 Rationale, alternative 1: I'm wondering if the example is correct. For a 4-byte-aligned type of size 3, the implementation cannot simply use 4-byte hardware-backed atomics because this will inevitably touch the 4th byte I think, and the implementation can't know whether this is padding or not. Or do we expect that things like packed structs are disallowed? N3.1: Why do you assume that 8-byte HW atomics are available on i386? Because cmpxchg8b is available for CPUs that are the lowest i?86 we still intend to support? I'd also use "hardware-backed" instead of "hardware backed". > > - The Rationale section in section 3 is also revised to remove the > > mentioning of "lock-free", but there is not major change of concept. > > > > - Add note N3.1 to emphasize the assumption of general hardware > > supported atomic instruction > > > > - Add note N3.2 to discuss the issues of cmpxchg16b See above. > > - Add a paragraph in section 4.1 to specify memory_order_consume must > > be implemented through memory_order_acquire. Section 4.2 emphasizes it > > again. > > > > - The specification of each runtime functions mostly maps to the > > corresponding generic functions in the C11 standard. Two functions are > > worth noting: > > 1) C11 atomic_compare_exchange
Re: make[1]: *** wait: No child processes during make -j8 check
On Tue, Jan 17, 2017 at 11:59 AM, Martin Sebor wrote: > On 01/17/2017 08:30 AM, Jeff Law wrote: >> >> On 01/16/2017 05:37 PM, Martin Sebor wrote: >>> >>> I've run into this failure during make check in the past with >>> a very large make -j value (such as -j128), but today I've had >>> two consecutive make check runs fail with -j12 and -j8 on my 8 >>> core laptop with no much else going on. The last thing running >>> was the go test suite. Has something changed recently that >>> could be behind it? (My user process limit is 62863.) >> >> What version of make are you using? There was a bug where make could >> lose track of children a while back. Ask Patsy -- she may remember the >> RHEL BZ and from that we can probably extract the version #s that were >> affected and see if they match the version you're running. > > > I have make 4.0 and I'm running Fedora 23. I haven't changed > anything on the machine in a while and used to be able to run > make -j16. I still can bootstrap with that but make check > has been failing recently. Three times yesterday (-12, -j8, > and -j12 again), but then the last one with -j12 completed. > > The frustrating thing is that after the failure I have to > restart make check from scratch, and so it can waste hours. I haven't seen any catastrophic failures, but have noticed some inconsistencies with parallel make check. Sometimes it seems to substitute a different filename. It seems like make check -j harness assumes some unique names for temporary / intermediate files that are not guaranteed by the scripts. Thanks, David
Help math RTL patterns...
Hi All, I am porting gcc for an internal processor and I am having some issues with math instructions. Our processor uses two operands for math instructions which are usually of the form OP0 = OP0 + OP1. The RTL pattern (for addm3) in gcc uses the form OP0 = OP1 + OP2. I understand that gcc supposedly supports the two operand flavor, but I have not been able to convince it to do that for me. I tried the following RTL pattern with no success: (define_insn "addhi3_op1_is_op0" [(set (match_operand:HI 0 "register_operand""=a") (plus:HI (match_dup 0) (match_operand:HI 1 "general_operand" "aim")))] "" { output_asm_insn("//Start of addhi3_op1_is_op0 %0 = %1 + %2", operands); output_asm_insn("//End of addhi3_op1_is_op0", operands); return(""); } ) So I used the three operand form and fixed things up in the code: (define_insn "addhi3_regtarget" [(set (match_operand:HI 0 "register_operand""=a") (plus:HI (match_operand:HI 1 "general_operand" "aim") (match_operand:HI 2 "general_operand" "aim")))] "" { output_asm_insn("//Start of addhi3_regtarget %0 = %1 + %2", operands); snap_do_basic_math_op_hi(operands, MATH_OP_PLUS); output_asm_insn("//End of addhi3_regtarget", operands); return(""); } ) Of course this does not work for all cases since my fixup cannot detect if the operands are the same memory location for OP0 and either OP1 or OP2. So I am back to trying to find the right RTL magic to do this right. I have looked over a number of machine descriptions but have not been able to find the precise pattern for this. Any help is greatly appreciated. Steve Silva (Broadcom)
Re: make[1]: *** wait: No child processes during make -j8 check
On 01/17/2017 09:59 AM, Martin Sebor wrote: On 01/17/2017 08:30 AM, Jeff Law wrote: On 01/16/2017 05:37 PM, Martin Sebor wrote: I've run into this failure during make check in the past with a very large make -j value (such as -j128), but today I've had two consecutive make check runs fail with -j12 and -j8 on my 8 core laptop with no much else going on. The last thing running was the go test suite. Has something changed recently that could be behind it? (My user process limit is 62863.) What version of make are you using? There was a bug where make could lose track of children a while back. Ask Patsy -- she may remember the RHEL BZ and from that we can probably extract the version #s that were affected and see if they match the version you're running. I have make 4.0 and I'm running Fedora 23. I haven't changed anything on the machine in a while and used to be able to run make -j16. I still can bootstrap with that but make check has been failing recently. Three times yesterday (-12, -j8, and -j12 again), but then the last one with -j12 completed. The frustrating thing is that after the failure I have to restart make check from scratch, and so it can waste hours. I went back and found the bug we recently fixed in RHEL, but it was a make-3.8x issue. Nothing useful in the RedHat/Fedora bug database. jeff
Re: Help math RTL patterns...
On 01/17/2017 12:19 PM, Steve Silva via gcc wrote: Hi All, I am porting gcc for an internal processor and I am having some issues with math instructions. Our processor uses two operands for math instructions which are usually of the form OP0 = OP0 + OP1. The RTL pattern (for addm3) in gcc uses the form OP0 = OP1 + OP2. I understand that gcc supposedly supports the two operand flavor, but I have not been able to convince it to do that for me. I tried the following RTL pattern with no success: So I used the three operand form and fixed things up in the code: That's nearly right. Use register constraints with the 3 op pattern: (define_insn "addhi3" [(set (match_operand:HI 0 "register_operand" "+a") (plus:HI (match_operand:HI 1 "register_operand" "0") (match_operand:HI 2 "general_operand" "aim")))] The sh port may be instructive, IIRC it has a bunch of 2-op insns. nathan -- Nathan Sidwell
Re: Help math RTL patterns...
Hi Nathan, Thanks for your advice. I retooled the addhi3 sequence to look like this: (define_expand "addhi3" [(set (match_operand:HI 0 "snap_mem_or_reg""+a,m") (plus:HI (match_operand:HI 1 "snap_mem_or_reg" "%0,0") (match_operand:HI 2 "general_operand" "aim,aim")))] "" "" ) (define_insn "addhi3_regtarget" [(set (match_operand:HI 0 "register_operand" "+a") (plus:HI (match_operand:HI 1 "register_operand" "%0") (match_operand:HI 2 "general_operand" "aim")))] "" { output_asm_insn("//Start of addhi3_regtarget %0 = %1 + %2", operands); snap_do_basic_math_op_hi(operands, MATH_OP_PLUS); output_asm_insn("//End of addhi3_regtarget", operands); return(""); } ) (define_insn "addhi3_memtarget" [(set (match_operand:HI 0 "memory_operand""+m") (plus:HI (match_operand:HI 1 "memory_operand" "%0") (match_operand:HI 2 "general_operand" "aim")))] "" { output_asm_insn("//Start of addhi3_memtarget %0 = %1 + %2", operands); snap_do_basic_math_op_hi(operands, MATH_OP_PLUS); output_asm_insn("//End of addhi3_memtarget", operands); return(""); } ) I compile a simple program with this: void addit() { int a, b, c; a = -10; b = 2; c = a + b; } And the compiler fails out with the following message: addit.c: In function 'addit': addit.c:12:1: internal compiler error: in find_reloads, at reload.c:4085 } ^ 0x8f5953 find_reloads(rtx_insn*, int, int, int, short*) ../../gcc-6.2.0/gcc/reload.c:4085 0x90327b calculate_needs_all_insns ../../gcc-6.2.0/gcc/reload1.c:1484 0x90327b reload(rtx_insn*, int) ../../gcc-6.2.0/gcc/reload1.c:995 0x7e8f11 do_reload ../../gcc-6.2.0/gcc/ira.c:5437 0x7e8f11 execute ../../gcc-6.2.0/gcc/ira.c:5609 It would seem that the constraints are somehow not right, but I am not familiar with the particular way the compiler does this step. Any insights or pointers? Thanks, Steve S On Tuesday, January 17, 2017 12:45 PM, Nathan Sidwell wrote: On 01/17/2017 12:19 PM, Steve Silva via gcc wrote: > Hi All, > > > I am porting gcc for an internal processor and I am having some issues with > math instructions. Our processor uses two operands for math instructions > which are usually of the form OP0 = OP0 + OP1. The RTL pattern (for addm3) > in gcc uses the form OP0 = OP1 + OP2. I understand that gcc supposedly > supports the two operand flavor, but I have not been able to convince it to > do that for me. I tried the following RTL pattern with no success: > So I used the three operand form and fixed things up in the code: That's nearly right. Use register constraints with the 3 op pattern: (define_insn "addhi3" [(set (match_operand:HI 0 "register_operand" "+a") (plus:HI (match_operand:HI 1 "register_operand" "0") (match_operand:HI 2 "general_operand" "aim")))] The sh port may be instructive, IIRC it has a bunch of 2-op insns. nathan -- Nathan Sidwell
Re: Help math RTL patterns...
On 01/17/2017 03:41 PM, Steve Silva wrote: Hi Nathan, Thanks for your advice. I retooled the addhi3 sequence to look like this: The md.texi file seems to have exactly the example you need: Here for example, is how the 68000 halfword-add instruction is defined: @smallexample (define_insn "addhi3" [(set (match_operand:HI 0 "general_operand" "=m,r") (plus:HI (match_operand:HI 1 "general_operand" "%0,0") (match_operand:HI 2 "general_operand" "di,g")))] @dots{}) no need for an expander and 2 insn patterns. -- Nathan Sidwell
gcc-5-20170117 is now available
Snapshot gcc-5-20170117 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/5-20170117/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 5 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-5-branch revision 244556 You'll find: gcc-5-20170117.tar.bz2 Complete GCC MD5=3735999e61a556643bf6801e6496a8ae SHA1=d7a04525d7193b96abce128687481db29c642fc6 Diffs from 5-20170110 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-5 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.