Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-24 Thread Rafael J. Wysocki
On Sunday, 24 June 2007 02:45, Eric W. Biederman wrote: > Andrew Morton <[EMAIL PROTECTED]> writes: > > > On Sun, 24 Jun 2007 01:54:52 +0200 "Rafael J. Wysocki" <[EMAIL PROTECTED]> > > wrote: > > > >> On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote: > >> > On Tue, Jun 19, 2007 at 01:49:3

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-24 Thread Rafael J. Wysocki
On Sunday, 24 June 2007 02:28, Siddha, Suresh B wrote: > On Sun, Jun 24, 2007 at 01:54:52AM +0200, Rafael J. Wysocki wrote: > > This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC > > nx6325). > > > > _cpu_down() just hangs as though there were a deadlock in there, 100% of the >

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Siddha, Suresh B
On Sat, Jun 23, 2007 at 06:45:05PM -0600, Eric W. Biederman wrote: > > Hmm. It looks like Siddha sent the wrong version of the patch. > The working tested version had an additional test to ensure > the mask and unmask methods were implemented. > > i.e. > + if (irq_desc[irq].chip->mas

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Eric W. Biederman
Andrew Morton <[EMAIL PROTECTED]> writes: > On Sun, 24 Jun 2007 01:54:52 +0200 "Rafael J. Wysocki" <[EMAIL PROTECTED]> > wrote: > >> On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote: >> > On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: >> > > >> > > This fixes the proble

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Siddha, Suresh B
On Sun, Jun 24, 2007 at 01:54:52AM +0200, Rafael J. Wysocki wrote: > This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325). > > _cpu_down() just hangs as though there were a deadlock in there, 100% of the > time. Does the patch at this URL work for you? http://marc.info/?

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Andrew Morton
On Sun, 24 Jun 2007 01:54:52 +0200 "Rafael J. Wysocki" <[EMAIL PROTECTED]> wrote: > On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote: > > On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: > > > > > > This fixes the problem! Hurrah! > > > > Great! Andrew, please include

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Rafael J. Wysocki
On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote: > On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: > > > > This fixes the problem! Hurrah! > > Great! Andrew, please include the appended patch in -mm. > > > Subject: [patch] x86_64, irq: use mask/unmask and proper

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Siddha, Suresh B
On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: > > This fixes the problem! Hurrah! Great! Andrew, please include the appended patch in -mm. Subject: [patch] x86_64, irq: use mask/unmask and proper locking in fixup_irqs From: Suresh Siddha <[EMAIL PROTECTED]> Force irq m

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Darrick J. Wong
On Tue, Jun 19, 2007 at 12:59:27PM -0700, Siddha, Suresh B wrote: > hmm.. Please try this instead. This is intended only for debug. Based on your > test results, we can comeup with a more decent fix. This fixes the problem! Hurrah! --D signature.asc Description: Digital signature

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Siddha, Suresh B
On Tue, Jun 19, 2007 at 12:06:37PM -0700, Darrick J. Wong wrote: > On Tue, Jun 19, 2007 at 11:00:03AM -0700, Siddha, Suresh B wrote: > > Anyhow, Darrick there is a general bug in this area, can you try this and > > see if it helps? > > Er... that instantly locked up the system. hmm.. Please try t

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Darrick J. Wong
On Tue, Jun 19, 2007 at 11:00:03AM -0700, Siddha, Suresh B wrote: > Anyhow, Darrick there is a general bug in this area, can you try this and > see if it helps? Er... that instantly locked up the system. --D signature.asc Description: Digital signature

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Eric W. Biederman
"Siddha, Suresh B" <[EMAIL PROTECTED]> writes: > On Tue, Jun 19, 2007 at 11:54:45AM -0600, Eric W. Biederman wrote: >> "Darrick J. Wong" <[EMAIL PROTECTED]> writes: >> >> > On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote: >> > >> >> > >> >> > [ 256.298787] irq=4341 affinity=d >

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Siddha, Suresh B
On Tue, Jun 19, 2007 at 11:54:45AM -0600, Eric W. Biederman wrote: > "Darrick J. Wong" <[EMAIL PROTECTED]> writes: > > > On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote: > > > >> > > >> > [ 256.298787] irq=4341 affinity=d > >> > > >> > >> And just to make sure, at this point,

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Eric W. Biederman
"Darrick J. Wong" <[EMAIL PROTECTED]> writes: > On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote: > >> > >> > [ 256.298787] irq=4341 affinity=d >> > >> >> And just to make sure, at this point, your MSI irq 4341 affinity >> (/proc/irq/4341/smp_affinity) still points to '2'? > >

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-18 Thread Darrick J. Wong
On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote: > > > > [ 256.298787] irq=4341 affinity=d > > > > And just to make sure, at this point, your MSI irq 4341 affinity > (/proc/irq/4341/smp_affinity) still points to '2'? Actually, it's 0xD. From the kernel's perspective the mask

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-18 Thread Siddha, Suresh B
On Mon, Jun 18, 2007 at 03:38:20PM -0700, Darrick J. Wong wrote: > On Thu, Jun 07, 2007 at 05:57:26PM -0700, Siddha, Suresh B wrote: > > > As you have the failing system, you need to do more detective work and > > help me out. Can you try this debug patch and send across the dmesg after > > the >

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-18 Thread Darrick J. Wong
On Thu, Jun 07, 2007 at 05:57:26PM -0700, Siddha, Suresh B wrote: > As you have the failing system, you need to do more detective work and > help me out. Can you try this debug patch and send across the dmesg after the > bug happens and also can you try different compiler to see if something > cha

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-07 Thread Siddha, Suresh B
On Wed, Jun 06, 2007 at 04:16:42PM -0700, Darrick J. Wong wrote: > On Wed, Jun 06, 2007 at 12:35:14PM -0700, Siddha, Suresh B wrote: > > > Weird. Then the bug can only happen if for some reason, "mask = map" > > didn't happen in fixup_irqs(). Can you send us the disassembly of the > > fixup_irqs()

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-06 Thread Darrick J. Wong
On Wed, Jun 06, 2007 at 12:35:14PM -0700, Siddha, Suresh B wrote: > Weird. Then the bug can only happen if for some reason, "mask = map" > didn't happen in fixup_irqs(). Can you send us the disassembly of the > fixup_irqs()? Attached. --D (gdb) disassemble fixup_irqs Dump of assembler code for f

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-06 Thread Siddha, Suresh B
On Wed, Jun 06, 2007 at 11:58:29AM -0700, Darrick J. Wong wrote: > On Tue, Jun 05, 2007 at 06:37:59PM -0700, Siddha, Suresh B wrote: > > On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote: > > > On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote: > > > > > > > Can you s

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-06 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 06:37:59PM -0700, Siddha, Suresh B wrote: > On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote: > > On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote: > > > > > Can you send us your system's dmesg aswell as output of /proc/interrupts? > > > > h

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote: > On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote: > > > Can you send us your system's dmesg aswell as output of /proc/interrupts? > > http://sweaglesw.net/~djwong/docs/dmesg > http://sweaglesw.net/~djwong/docs/int

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote: > Can you send us your system's dmesg aswell as output of /proc/interrupts? http://sweaglesw.net/~djwong/docs/dmesg http://sweaglesw.net/~djwong/docs/interrupts --D signature.asc Description: Digital signature

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Tue, Jun 05, 2007 at 01:09:54PM -0700, Darrick J. Wong wrote: > On Tue, Jun 05, 2007 at 11:40:15AM -0700, Siddha, Suresh B wrote: > > > Does this problem happen only under certain stress or something simple, like > > > > boot the kernel > > echo 2 > /proc/irq/114/smp_affinity > > wait for irq

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 11:40:15AM -0700, Siddha, Suresh B wrote: > Does this problem happen only under certain stress or something simple, like > > boot the kernel > echo 2 > /proc/irq/114/smp_affinity > wait for irq to hit the cpu1. > echo 0 > /sys/devices/system/cpu/cpu1/online > > will immmd

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Tue, Jun 05, 2007 at 11:33:01AM -0700, Darrick J. Wong wrote: > On Tue, Jun 05, 2007 at 11:13:42AM -0700, Siddha, Suresh B wrote: > > I see. Your system should have 4 or 8 logical cpu's right. So you must be > > using logical flat mode, right? > > I believe so. The system has two Xeon 5150s wi

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 11:13:42AM -0700, Siddha, Suresh B wrote: > I see. Your system should have 4 or 8 logical cpu's right. So you must be > using logical flat mode, right? I believe so. The system has two Xeon 5150s with an Intel 5000 chipset of some sort. > When this bug happens, what does

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Tue, Jun 05, 2007 at 10:36:47AM -0700, Darrick J. Wong wrote: > On Tue, Jun 05, 2007 at 10:23:10AM -0700, Siddha, Suresh B wrote: > > > Darrick, I see a kernel bug in this area(which is already filled with bugs, > > and I am looking into ways to fix them). Are you making sure that > > between s

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 10:23:10AM -0700, Siddha, Suresh B wrote: > Darrick, I see a kernel bug in this area(which is already filled with bugs, > and I am looking into ways to fix them). Are you making sure that > between step-1 and step-2, that interrupts actually started arriving at cpu1? > > i

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Thu, May 31, 2007 at 05:44:27PM -0700, Darrick J. Wong wrote: > Hi there, > > I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid > about offlining CPUs. I suspect that this problem extends beyond a > particular machine, as I've been able to replicate it with an IBM x3650 > an

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-03 Thread Emmanuel Fusté
> > This is just getting confusing. > > Emmanuel Fust. Please play with /proc/irq/*/smp_affinity by and and > confirm that you can move your irqs. This will confirm it is the decision > part. > Ok, as planned, you're right ;-) , playing with /proc/irq/*/smp_affinity let me move irqs. Emmanuel

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Eric W. Biederman
"Darrick J. Wong" <[EMAIL PROTECTED]> writes: > On Fri, Jun 01, 2007 at 06:18:32PM -0600, Eric W. Biederman wrote: > >> I doubt it. The practical problem is that cpu_down does not >> and by design can not call the irq balancing part properly >> and I haven't yet seen anything to suggest that we d

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Darrick J. Wong
On Fri, Jun 01, 2007 at 06:18:32PM -0600, Eric W. Biederman wrote: > I doubt it. The practical problem is that cpu_down does not > and by design can not call the irq balancing part properly > and I haven't yet seen anything to suggest that we don't migrate > irq properly. > > So I'm guessing it

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Eric W. Biederman
> As a side note, on my very old SMP machine, 2.6.20 correctly > load-balance IRQs across CPU but 2.6.21 not. I know that > in-kernel IRQ load balancer is marked as deprecated and > somewhat broken, but with your report it make me think it > could be a bug in the IRQ rerouting part in my case too

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Emmanuel Fusté
> There exists a similar scenario. Set the IRQ affinity to a bunch of > CPUs, watch /proc/interrupts to see which CPU is actually servicing the > interrupts, then offline that CPU. The kernel does not reroute the IRQ > to any of the other CPUs and the device also hangs. > > The furthest that I've

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Eric W. Biederman
"Darrick J. Wong" <[EMAIL PROTECTED]> writes: > Hi there, > > I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid > about offlining CPUs. I suspect that this problem extends beyond a > particular machine, as I've been able to replicate it with an IBM x3650 > and an IBM x3755. Th

Device hang when offlining a CPU due to IRQ misrouting

2007-05-31 Thread Darrick J. Wong
Hi there, I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid about offlining CPUs. I suspect that this problem extends beyond a particular machine, as I've been able to replicate it with an IBM x3650 and an IBM x3755. This is what I'm doing: 1) I tie an IRQ to a particular CPU