On 07/28/2014 04:41, Matthew Fortune wrote: > Hi Joshua, > > I know very little about this area but I'll try and offer some advice > anyway... >
You know more than I do :) >> On 07/05/2014 23:43, Joshua Kinard wrote: >>> Hi, >>> >>> I filed PR61538 about two weeks ago, regarding gcc-4.8.x and up not >>> compiling a g++/pthreads-linked app correctly on SGI R1x000-based systems >>> (Octane, Onyx2), running Linux. Running the subsequently-compiled >>> application simply hangs in a futex syscall until terminated via Ctrl+C. >> I >>> suspect it's a double-locking bug of some design, as evidenced by strace >>> showing two consecutive syscall()'s w/ 0x108e passed as the syscall # >> (4238 >>> or futex on o32 MIPS), but I am stumped as to what else I can do to debug >> it >>> and help fix it. >>> >> [snip] >>> Full details: >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61538 >> >> So I've spent the last few weeks bisecting the gcc tree, and I've narrowed >> down the set of commits that appear to have introduced this problem: >> >> 1. 39a8c5eaded1e5771a941c56a49ca0a5e9c5eca0 * config/mips/mips.c >> (mips_emit_pre_atomic_barrier_p,) > > This is the prime candidate for introducing the issue. This is my guess, too. However, it appears to tie in w/ the fourth commit because the new mips_emit_{pre,post}_atomic_barrier_p functions added in commit 39a8c5ea are removed by commit 30c3c442 a mere ~7 minutes later (which I find really odd). Commit 974f0a74 is really the only one that seems innocent, but I suspect the other three are linked. If mkuvyrkov is still around, perhaps he could explain better? >> 2. 974f0a74e2116143b88d8cea8e1dd5a9c18ef96c * config/mips/constraints.md >> (ZR): New constraint. > > Unlikely > >> 3. 0f8e46b16a53c02d7255dcd6b6e9b5bc7f8ec953 * config/mips/mips.c >> (mips_process_sync_loop): Emit cmp result only if > > Possible but unlikely still > >> 4. 30c3c4427521f96fb58b6e1debb86da4f113f06f * emit-rtl.c >> (need_atomic_barrier_p): New function. > > Seems unlikely > >> >> There's a build failure somewhere in the middle of there that is blocking me >> from figuring out which specific one is the cause, but they all appear to be >> related anyways. All four were added on 2012-06-20. >> >> When I took a git checkout from 2012-06-26 and reverted those four commits, >> I was able to compile glibc-2.19 and get a working "sln" binary. I am >> unable to easily test the C++ side because I built the checkouts in my >> $HOME, and it's too risky to try and shoehorn one of them in as the system >> compiler. However, I think the C++ issue is also fixed by reverting the >> four, as that also involved hanging in Linux futex syscalls. > > Here is a wild guess at the problem... I think the workaround for R10000 to > use branch likely instead of delay slot branches is ending up annulling > an instruction that is required for certain atomic operations. This is an > entirely untested theory (and patch) but can you see if this fixes the issue > you are seeing: Well, the branch-likely thing really only affects a specific revision of the R10000 processors. Later R10000 revisions (3.1+?) and R12000-R16000 shouldn't be affected. I've been playing with disabling that specific workaround on my Octane's kernel and haven't seen any ill effects yet. Though, I haven't tried rebuilding the userland w/ -mno-fix-r10000 just yet. If you want, you can take a look at some of the additional info in the corresponding Gentoo bug that tracks PR61538: https://bugs.gentoo.org/show_bug.cgi?id=516548 I have a gdb run (comment #5) of the several instructions in __lll_lock_wait_private, including register values, as each instruction executes. The hang happens after taking the futex syscall, t0-t3 get set to 0x0, and the following "ll v0,0(s0)" is what hangs. In gcc-4.7 and earlier, that 'll' is actually "li v0,2", though control never passes into __lll_lock_wait_private in the first place. There's also a PNG attached to that bug of the disassembled asm in WinMerge they shows what insns actually changed. Someone who understands MIPS asm ordering might be able to make something of that. > @@ -13014,7 +13023,8 @@ mips_process_sync_loop (rtx insn, rtx *operands) > mips_multi_copy_insn (tmp3_insn); > mips_multi_set_operand (mips_multi_last_index (), 0, newval); > } > - else if (!(required_oldval && cmp)) > + else if (!(required_oldval && cmp) > + || mips_branch_likely) > mips_multi_add_insn ("nop", NULL); > > /* CMP = 1 -- either standalone or in a delay slot. */ > > I suspect I can weave that in more naturally but can you tell me if that > fixes the problem first. Testing a fix takes about 7.5hrs to rebuild, plus another 3.5 to rebuild glibc. So I am a bit hesitant to task the machine to do that w/o having a better idea if that solves it or not. Technically, shouldn't passing -mno-fix-r10000 have a similar effect by causing branch-likely insns to not get emitted at all? Thanks!, -- Joshua Kinard Gentoo/MIPS ku...@gentoo.org 4096R/D25D95E3 2011-03-28 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic