On Tue, 2 Jan 2007, Alistair John Strachan wrote: > > eax: 00000008 ebx: 00000000 ecx: 00000008 edx: 00000000 > esi: f70f3e9c edi: f7017c00 ebp: f70f3c1c esp: f70f3c0c > > Code: 58 01 00 00 0f 4f c2 09 c1 89 c8 83 c8 08 85 db 0f 44 c8 8b 5d f4 89 c8 > 8b 75 f8 8b 7d fc 89 ec 5d c3 89 ca 8b 46 6c 83 ca 10 3b <87> 68 01 00 00 0f > 45 ca eb b6 8d b6 00 00 00 00 55 b8 01 00 00 > EIP: [<c0156f60>] pipe_poll+0xa0/0xb0 SS:ESP 0068:f70f3c0c > > Chuck observed that the kernel tries to reenter pipe_poll half way through an > instruction (c0156f5f->c0156f60); it's not a single-bit error but an > off-by-one.
It's not an off-by-one either (eg say we're taking an exception and screiwing up %eip by one somehow). The code sequence in question is mov %ecx,%edx mov 0x6c(%esi),%eax or $0x10,%edx cmp 0x168(%edi),%eax <-- cmovne %edx,%ecx jmp ... and it's in the second byte of the "cmp". And yes, it definitely entered there, because trying other random entry-points will have either invalid instructions or instructions that would fault due to NULL pointers. HOWEVER, it's also not as simple as "took an interrupt, and returned with %eip incremented by one", becasue your %edx is zero, so it won't have done that "or $10,%edx" and then some interrupt happened and screwed up just %eip. So it's literally a random %eip, but since you say it's consistently in that function, it's not truly "random". There's something that triggers it just _there_. However, that's a damn simple function. There's _nothing_ there. The particular code that is involved right there is literally if (!pipe->writers && filp->f_version != pipe->w_counter) mask |= POLLHUP; and that's it. There's not even anything half-way interesting around it, except for the "poll_wait()" call, but even that is about as common as you can humanly get.. Looking at the register set and the stack, I see: Stack: 00000000 00000000 <- saved %ebx (dunno, seems dead in caller) f70f3e9c <- saved %esi (== pollfd in do_pollfd) f6e111c0 <- saved %edi (== filp) f70f3fa4 <- outer EBP (looks reasonable) c015d7f3 <- return address (do_sys_poll+0x253/0x480) and the strange thing is that when the oops happens, it really looks like %esi _still_ contains the value it had originally (and that is saved on the stack). But afaik, from your disassembly, it should have been overwritten by the initial %eax, which should have had the same value as %edi on entry... IOW, none of it really makes any sense. The stack frames look fine, so we _did_ enter at the beginning of the function (and it wasn't the *poll fn pointer that was corrupt. > The suggestions I've had so far which I have not yet tried: > > - Select a different x86 CPU in the config. > - Unfortunately the C3-2 flags seem to simply tell GCC > to schedule for ppro (like i686) and enabled MMX and SSE > - Probably useless Actually, try this one. Try using something that doesn't like "cmov". Maybe the C3-2 simply has some internal cmov bugginess. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/