Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-20 Thread Peter Zijlstra
On Wed, 2011-07-20 at 09:04 -0700, Linus Torvalds wrote: > On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra > wrote: > > > > Right, so we can either merge my scary patches now and have 3.0 boot on > > 16+ node machines (and risk breaking something), or delay them until > > 3.0.1 and have 16+ node

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-20 Thread Ingo Molnar
* Linus Torvalds wrote: > On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra > wrote: > > > > Right, so we can either merge my scary patches now and have 3.0 > > boot on 16+ node machines (and risk breaking something), or delay > > them until 3.0.1 and have 16+ node machines suffer a little. >

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-20 Thread Linus Torvalds
On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra wrote: > > Right, so we can either merge my scary patches now and have 3.0 boot on > 16+ node machines (and risk breaking something), or delay them until > 3.0.1 and have 16+ node machines suffer a little. So how much impact does your scary patch ha

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-20 Thread Peter Zijlstra
On Wed, 2011-07-20 at 07:40 -0700, Linus Torvalds wrote: > On Wed, Jul 20, 2011 at 5:14 AM, Anton Blanchard wrote: > > > >> So with that fix the patch makes the machine happy again? > > > > Yes, the machine looks fine with the patches applied. Thanks! > > Ok, so what's the situation for 3.0 (I'm

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-20 Thread Linus Torvalds
On Wed, Jul 20, 2011 at 5:14 AM, Anton Blanchard wrote: > >> So with that fix the patch makes the machine happy again? > > Yes, the machine looks fine with the patches applied. Thanks! Ok, so what's the situation for 3.0 (I'm waiting for some RCU resolution now)? Anton's patch may be small, but t

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-20 Thread Anton Blanchard
Hi Peter, > So with that fix the patch makes the machine happy again? Yes, the machine looks fine with the patches applied. Thanks! Anton ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-20 Thread Peter Zijlstra
On Wed, 2011-07-20 at 20:14 +1000, Anton Blanchard wrote: > > That looks very strange indeed.. up to node 23 there is the normal > > symmetric matrix with all the trace elements on 10 (as we would expect > > for local access), and some 4x4 sub-matrix stacked around the trace > > with 20, suggestin

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-20 Thread Anton Blanchard
Hi Peter, > That looks very strange indeed.. up to node 23 there is the normal > symmetric matrix with all the trace elements on 10 (as we would expect > for local access), and some 4x4 sub-matrix stacked around the trace > with 20, suggesting a single hop distance, and the rest on 40 being > out

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-19 Thread Anton Blanchard
Hi, > That looks very strange indeed.. up to node 23 there is the normal > symmetric matrix with all the trace elements on 10 (as we would expect > for local access), and some 4x4 sub-matrix stacked around the trace > with 20, suggesting a single hop distance, and the rest on 40 being > out-there

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-19 Thread Peter Zijlstra
On Tue, 2011-07-19 at 14:44 +1000, Anton Blanchard wrote: > > Our node distances are a bit arbitrary (I make them up based on > information given to us in the device tree). In terms of memory we have > a maximum of three levels. To give some gross estimates, on chip memory > might be 30GB/sec, on

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-18 Thread Anton Blanchard
On Mon, 18 Jul 2011 23:35:56 +0200 Peter Zijlstra wrote: > Anton, could you test the below two patches on that machine? > > It should make things boot again, while I don't have a machine nearly > big enough to trigger any of this, I tested the new code paths by > setting FORCE_SD_OVERLAP in /deb

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-18 Thread Peter Zijlstra
Anton, could you test the below two patches on that machine? It should make things boot again, while I don't have a machine nearly big enough to trigger any of this, I tested the new code paths by setting FORCE_SD_OVERLAP in /debug/sched_features. Although any review of the error paths would be mu

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-15 Thread Peter Zijlstra
On Fri, 2011-07-15 at 10:45 +1000, Anton Blanchard wrote: > Hi, > > > Urgh.. so those spans are generated by sched_domain_node_span(), and > > it looks like that simply picks the 15 nearest nodes to the one we've > > got without consideration for overlap with previously generated spans. > > I do

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-14 Thread Anton Blanchard
Hi, > Urgh.. so those spans are generated by sched_domain_node_span(), and > it looks like that simply picks the 15 nearest nodes to the one we've > got without consideration for overlap with previously generated spans. I do wonder if we need this extra level at all on ppc64. From memory SGI add

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-14 Thread Peter Zijlstra
On Thu, 2011-07-14 at 14:35 +1000, Anton Blanchard wrote: > I also printed out the cpu spans as we walk through build_sched_groups: > 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 > Duplicates start appearing in this span: > 128 160 192 224 256 288 320 352 384 416 448 480 512 544 57

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-13 Thread Anton Blanchard
> I took a quick look and we are stuck in update_group_power: > > do { > power += group->cpu_power; > group = group->next; > } while (group != child->groups); > > I looked at the linked list: > > child->groups = c07b2f74ff00 > > and dumping g

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-13 Thread Anton Blanchard
Hi Peter, > Surely this isn't the first multi-node P7 to boot a kernel with this > patch? If my git foo is any good it hit -next on 23rd of May. > > I guess I'm asking is, do smaller P7 machines boot? And if so, is > there any difference except size? > > How many nodes does the thing have anywa

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-07 Thread Peter Zijlstra
On Thu, 2011-07-07 at 17:25 +0530, Mahesh J Salgaonkar wrote: > > I guess I'm asking is, do smaller P7 machines boot? And if so, is there > > any difference except size? > > Yes, the smaller P7 machine that I have with 20 CPUs and 2GB ram boots > fine with 3.0.0-rc. That sounds like a single nod

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-07 Thread Mahesh J Salgaonkar
On 2011-07-07 12:59:35 Thu, Peter Zijlstra wrote: > On Thu, 2011-07-07 at 15:52 +0530, Mahesh J Salgaonkar wrote: > > > > 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae - > > "sched: Change NODE sched_domain group creation" as the cause. > > Weird, there's no locking anyw

Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-07 Thread Peter Zijlstra
On Thu, 2011-07-07 at 15:52 +0530, Mahesh J Salgaonkar wrote: > > 2.6.39 booted fine on the system and a git bisect shows commit cd4ea6ae - > "sched: Change NODE sched_domain group creation" as the cause. Weird, there's no locking anywhere around there. The typical problems with this patch-set we

[regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

2011-07-07 Thread Mahesh J Salgaonkar
Hi, linux-3.0-rc fails to boot on a power7 system with 1TB ram and 896 CPUs. While the initial boot log shows a soft-lockup [1], the machine is hung after. Dropping into xmon shows the cpus are all struck at: cpu 0xa: Vector: 100 (System Reset) at [c00fae51fae0] pc: c0