Re: Pegasos OHCI bug (was Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55)

2010-10-27 Thread Benjamin Herrenschmidt
> Since then, the silence has been deafening. > > My assumption now is that this is not ever getting fixed. I'm certainly not > able to fix it. I'm not a even kernel programmer! I got far enough to > diagnose the cause just with the "add more printk's and boot it again" > technique. Hundreds of r

Re: Pegasos OHCI bug (was Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55)

2010-10-27 Thread Olaf Hering
On Wed, Oct 27, pac...@kosh.dhis.org wrote: > |1. How do I locate all usb nodes in the device tree? > | > |2. How do I know if a particular usb node is OHCI? In the installed system, run 'lspci | grep -i usb', this gives the pci bus numbers. Then run 'find /sys -name devspec', and look or the bu

Pegasos OHCI bug (was Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55)

2010-10-27 Thread pacman
Benjamin Herrenschmidt writes: > > Ok so you'll have to make up a "workaround" in prom_init that looks for > OHCI's in the device-tree and disable them. > > Check if the OHCI node has some existing f-code words you can use for > that with "dev /path-to-ohci words" in OF for example. If not, you m

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-22 Thread pacman
Benjamin Herrenschmidt writes: > > On Wed, 2010-10-20 at 13:33 -0500, pac...@kosh.dhis.org wrote: > > > Just try :-) "quiesce" is something that afaik only apple ever > > > implemented anyways. It uses hooks inside their OF to shut down all > > > drivers that do bus master (among other HW sanitiza

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-20 Thread Benjamin Herrenschmidt
On Wed, 2010-10-20 at 13:33 -0500, pac...@kosh.dhis.org wrote: > > Just try :-) "quiesce" is something that afaik only apple ever > > implemented anyways. It uses hooks inside their OF to shut down all > > drivers that do bus master (among other HW sanitization tasks). > > I booted a version with

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-20 Thread pacman
Benjamin Herrenschmidt writes: > > On Tue, 2010-10-19 at 22:23 -0500, pac...@kosh.dhis.org wrote: > > The diff fragment above applied inside prom_close_stdin, but there are > > some > > prom_printf calls after prom_close_stdin. Calling prom_printf after > > closing > > stdout sounds like it could

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-20 Thread Benjamin Herrenschmidt
On Tue, 2010-10-19 at 22:23 -0500, pac...@kosh.dhis.org wrote: > The diff fragment above applied inside prom_close_stdin, but there are > some > prom_printf calls after prom_close_stdin. Calling prom_printf after > closing > stdout sounds like it could be bad. If I moved it down below all the > pro

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread pacman
Benjamin Herrenschmidt writes: > > On Tue, 2010-10-19 at 22:47 +0200, Segher Boessenkool wrote: > > > > It looks like it is the frame counter in an USB OHCI HCCA. > > 16-bit, 1kHz update, offset x'80 in a page. > > > > So either the kernel forgot to call quiesce on it, or the firmware > > doesn'

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread Benjamin Herrenschmidt
On Tue, 2010-10-19 at 22:47 +0200, Segher Boessenkool wrote: > > It looks like it is the frame counter in an USB OHCI HCCA. > 16-bit, 1kHz update, offset x'80 in a page. > > So either the kernel forgot to call quiesce on it, or the firmware > doesn't implement that, or the firmware messed up some

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread Benjamin Herrenschmidt
On Tue, 2010-10-19 at 13:10 -0500, pac...@kosh.dhis.org wrote: > > So what type of driver, firmware, or hardware bug puts a 16-bit 1000Hz > timer > in memory, and does it in little-endian instead of the CPU's native > byte > order? And why does it stop doing it some time during the early init > sc

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread Segher Boessenkool
> I made a new discovery. And this nails it :-) > So then I ran > dd if=/dev/mem bs=4 count=1 skip=$((0xfc5c080/4)) | od -t x4 > a few times very fast, plucking the first affected word directly out of > memory by its physical address. The result: > > The low 16 bits are always zero as before. T

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread pacman
Benjamin Herrenschmidt writes: > > > > I thought of that, but as far as I can tell, this CPU doesn't have DABR. > > AFAIK, the 7447 is just a derivative of the 7450 design which -does- > have a DABR ... Unless it's broken :-) Hmm. gdb resorts to single-stepping when I set a watchpoint while debu

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread Helmut Grohne
On Mon, Oct 18, 2010 at 11:55:44PM +0200, Thomas Gleixner wrote: > I might be completely one off as usual, but this thing reminds me of a > bug I stared at yesterday night: This problem is completely unrelated. My problem was caused by using binutils-gold. Helmut _

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread Thomas Gleixner
On Tue, 19 Oct 2010, Helmut Grohne wrote: > On Mon, Oct 18, 2010 at 11:55:44PM +0200, Thomas Gleixner wrote: > > I might be completely one off as usual, but this thing reminds me of a > > bug I stared at yesterday night: > > This problem is completely unrelated. My problem was caused by using > b

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread Benjamin Herrenschmidt
> > >From there, you might be able to close onto the culprit a bit more, for > > example, try using the DABR register to set data access breakpoints > > shortly before the corruption spot. AFAIK, On those old 32-bit CPUs, you > > can set whether you want it to break on a real or a virtual address.

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread Thomas Gleixner
On Mon, 18 Oct 2010, Andrew Morton wrote: > On Mon, 18 Oct 2010 12:33:31 +0100 > Mel Gorman wrote: > > > A bit but I still don't know why it would cause corruption. Maybe this is > > still > > a caching issue but the difference in timing between list_add and > > list_add_tail > > is enough to

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread pacman
Benjamin Herrenschmidt writes: > > You can do something fun... like a timer interrupt that peeks at those > physical addresses from the linear mapping for example, and try to find > out "when" they get set to the wrong value (you should observe the load > from disk, then the corruption, unless the

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread Benjamin Herrenschmidt
On Mon, 2010-10-18 at 14:10 -0500, pac...@kosh.dhis.org wrote: > I've been flailing around quite a bit. Here's my latest result: > > Since I can view the corruption with md5sum /sbin/e2fsck, I know it's in a > clean cached page. So I made an extra copy of /sbin/e2fsck, which won't be > loaded int

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread Benjamin Herrenschmidt
On Mon, 2010-10-18 at 12:37 -0700, Andrew Morton wrote: > Well, you've spotted a bug so I'd say we fix it asap. > > It's a bit of a shame that we lose the only known way of reproducing a > different bug, but presumably that will come back and bite someone > else > one day, and we'll fix it then :(

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread Benjamin Herrenschmidt
On Wed, 2010-10-13 at 15:40 +0100, Mel Gorman wrote: > > This is somewhat contrived but I can see how it might happen even on one > CPU particularly if the L1 cache is virtual and is loose about checking > physical tags. > > > How sensitive/vulnerable is PPC32 to such things? > > > > I can not

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread Andrew Morton
On Mon, 18 Oct 2010 12:33:31 +0100 Mel Gorman wrote: > A bit but I still don't know why it would cause corruption. Maybe this is > still > a caching issue but the difference in timing between list_add and > list_add_tail > is enough to hide the bug. It's also possible there are some registers >

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread pacman
Mel Gorman writes: > > A bit but I still don't know why it would cause corruption. Maybe this is > still > a caching issue but the difference in timing between list_add and > list_add_tail > is enough to hide the bug. It's also possible there are some registers > ioremapped after the memmap arra

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread Mel Gorman
On Wed, Oct 13, 2010 at 12:52:05PM -0500, pac...@kosh.dhis.org wrote: > Mel Gorman writes: > > > > On Mon, Oct 11, 2010 at 02:00:39PM -0700, Andrew Morton wrote: > > > > > > It's corruption of user memory, which is unusual. I'd be wondering if > > > there was a pre-existing bug which 6dda9d55bf5

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-13 Thread pacman
Mel Gorman writes: > > On Mon, Oct 11, 2010 at 02:00:39PM -0700, Andrew Morton wrote: > > > > It's corruption of user memory, which is unusual. I'd be wondering if > > there was a pre-existing bug which 6dda9d55bf545013597 has exposed - > > previously the corruption was hitting something harmles

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-13 Thread Mel Gorman
On Mon, Oct 11, 2010 at 02:00:39PM -0700, Andrew Morton wrote: > (cc linuxppc-dev@lists.ozlabs.org) > > On Mon, 11 Oct 2010 15:30:22 +0100 > Mel Gorman wrote: > > > On Sat, Oct 09, 2010 at 04:57:18AM -0500, pac...@kosh.dhis.org wrote: > > > (What a big Cc: list... scripts/get_maintainer.pl made

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-11 Thread Andrew Morton
(cc linuxppc-dev@lists.ozlabs.org) On Mon, 11 Oct 2010 15:30:22 +0100 Mel Gorman wrote: > On Sat, Oct 09, 2010 at 04:57:18AM -0500, pac...@kosh.dhis.org wrote: > > (What a big Cc: list... scripts/get_maintainer.pl made me do it.) > > > > This will be a long story with a weak conclusion, sorry a