On Sat, 20 Aug 2011, Steven Hartland wrote:
Are you seeing a double fault panic?
We're seeing both. At least one double (or more) fault finishing with
"Fatal Trap 12: page fault while in kernel mode". Subsequent panics have
been single fault (all visible on the IPMI console) "Fatal Trap 9:
ge
On 08/21/11 05:01, Steven Hartland wrote:
- Original Message - From: "Jamie Gritton"
The problem isn't with the conditional locking of tpr in prison_deref.
That locking is actually correct, and there's no race condition.
Are you sure? I do think that unlocking the mtx half way through
- Original Message -
From: "Jamie Gritton"
In essence I think we can get the following flow where 1# = process1
and 2# = process2
1#1. prison1.pr_uref = 1 (single process jail)
1#2. prison_deref( prison1,...
1#3. prison1.pr_uref-- (prison1.pr_uref = 0)
1#3. prison1.mtx_unlock <-- this no
On 08/20/11 19:19, Steven Hartland wrote:
- Original Message - From: "Andriy Gapon"
on 20/08/2011 23:24 Steven Hartland said the following:
- Original Message - From: "Steven Hartland"
Looking through the code I believe I may have noticed a scenario
which could
trigger the pr
- Original Message -
From: "Andriy Gapon"
on 20/08/2011 23:24 Steven Hartland said the following:
- Original Message - From: "Steven Hartland"
Looking through the code I believe I may have noticed a scenario which could
trigger the problem.
Given the following code:-
static
- Original Message -
From: "Steven Hartland"
Something else you many be more interested in Andriy:-
I added in debugging options DDB & INVARIANTS to see if I can get a more
useful info and the panic results in a looping panic constantly scrolling up
the console. Not sure if this is a
- Original Message -
From: "Andriy Gapon"
diff -u sys/kern/kern_jail.c.orig sys/kern/kern_jail.c
--- sys/kern/kern_jail.c.orig 2011-08-20 21:17:14.856618854 +0100
+++ sys/kern/kern_jail.c2011-08-20 21:18:35.307201425 +0100
@@ -2455,7 +2455,8 @@
if (--tp
on 20/08/2011 23:24 Steven Hartland said the following:
> - Original Message - From: "Steven Hartland"
>> Looking through the code I believe I may have noticed a scenario which could
>> trigger the problem.
>>
>> Given the following code:-
>>
>> static void
>> prison_deref(struct prison *p
- Original Message -
From: "Steven Hartland"
Looking through the code I believe I may have noticed a scenario which could
trigger the problem.
Given the following code:-
static void
prison_deref(struct prison *pr, int flags)
{
struct prison *ppr, *tpr;
int vfslocked;
if (!(fl
- Original Message -
From: "Andriy Gapon"
thanks for doing this! I'll reiterate my suspicion just in case - I think that
you should look for the cases where you stop a jail, but then re-attach and
resurrect the jail before it's completely dead.
Yer that's where I think its happening
- Original Message -
From: "Roger Marquis"
To: ;
Sent: Saturday, August 20, 2011 7:10 PM
Subject: Re: debugging frequent kernel panics on 8.2-RELEASE
Repeat this enough times and prison0.pr_uref reaches zero.
To reach zero even sooner just kill enough of non-jailed
Repeat this enough times and prison0.pr_uref reaches zero.
To reach zero even sooner just kill enough of non-jailed processes.
Interesting. We've been getting kernel panics in -stable but with only
one jail started at boot without being restarted.
Are you using SAS drives by any chance? Setti
on 20/08/2011 18:51 Steven Hartland said the following:
> - Original Message - From: "Andriy Gapon"
>
>> BTW, I suspect the following scenario, but I am not able to verify it either
>> via
>> testing or in the code:
>> - last process in a dying jail exits
>> - pr_uref of the jail reaches
- Original Message -
From: "Andriy Gapon"
BTW, I suspect the following scenario, but I am not able to verify it either via
testing or in the code:
- last process in a dying jail exits
- pr_uref of the jail reaches zero
- pr_uref of prison0 gets decremented
- you attach to the jail and
- Original Message -
From: "Andriy Gapon"
BTW, I suspect the following scenario, but I am not able to
verify it either via testing or in the code:
- last process in a dying jail exits
- pr_uref of the jail reaches zero
- pr_uref of prison0 gets decremented
- you attach to the jail and
on 20/08/2011 13:02 Andriy Gapon said the following:
> on 18/08/2011 02:15 Steven Hartland said the following:
>> In a nutshell the jail manager we're using will attempt to resurrect the jail
>> from a dieing state in a few specific scenarios.
>>
>> Here's an exmaple:-
>> 1. jail restart requested
on 18/08/2011 02:15 Steven Hartland said the following:
> In a nutshell the jail manager we're using will attempt to resurrect the jail
> from a dieing state in a few specific scenarios.
>
> Here's an exmaple:-
> 1. jail restart requested
> 2. jail is stopped, so the java processes is killed off,
on 19/08/2011 15:14 John Baldwin said the following:
> Yes, it is a bug in kgdb that it only walks allproc and not zombproc. Try
> this:
The patch worked perfectly well for me, thank you!
> Index: kthr.c
> ===
> --- kthr.c(revi
On Thursday, August 18, 2011 4:09:35 pm Andriy Gapon wrote:
> on 17/08/2011 23:21 Andriy Gapon said the following:
> > It seems like everything starts with some kind of a race between terminating
> > processes in a jail and termination of the jail itself. This is where the
> > details are very thi
2011/8/18 Andriy Gapon :
> on 17/08/2011 23:21 Andriy Gapon said the following:
>>
>> It seems like everything starts with some kind of a race between
>> terminating
>> processes in a jail and termination of the jail itself. This is where the
>> details are very thin so far. What we see is that a
on 17/08/2011 23:21 Andriy Gapon said the following:
It seems like everything starts with some kind of a race between terminating
processes in a jail and termination of the jail itself. This is where the
details are very thin so far. What we see is that a process (http) is in
exit(2) syscall, i
on 18/08/2011 14:11 Andriy Gapon said the following:
> Probably I have mistakenly assumed that the 'prison' in prison_derefer() has
> something to do with an actual jail, while it could have been just prison0
> where
> all non-jailed processes belong.
So, indeed:
(kgdb) p $2->p_ucred->cr_prison
$
- Original Message -
From: "Andriy Gapon"
Probably I have mistakenly assumed that the 'prison' in prison_derefer() has
something to do with an actual jail, while it could have been just prison0 where
all non-jailed processes belong.
That makes sense as this particular panic was cause
on 18/08/2011 13:35 Steven Hartland said the following:
> - Original Message - From: "Andriy Gapon"
>>> Thats interesting, are you using http as an example or is that something
>>> thats
>>> been gleaned from the debugging of our output? I ask as there's only one
>>> process
>>> running
- Original Message -
From: "Andriy Gapon"
Thats interesting, are you using http as an example or is that something thats
been gleaned from the debugging of our output? I ask as there's only one process
running in each of our jails and thats a single java process.
It's from the debug d
on 18/08/2011 02:15 Steven Hartland said the following:
> - Original Message - From: "Andriy Gapon"
>
>> Thanks to the debug that Steven provided and to the help that I received from
>> Kostik, I think that now I understand the basic mechanics of this panic, but,
>> unfortunately, not the
- Original Message -
From: "Andriy Gapon"
Thanks to the debug that Steven provided and to the help that I received from
Kostik, I think that now I understand the basic mechanics of this panic, but,
unfortunately, not the details of its root cause.
It seems like everything starts with
On Wed, Aug 17, 2011 at 11:21:42PM +0300, Andriy Gapon wrote:
[skip]
> But I also would like to use this opportunity to discuss how we can
> make it easier to debug such issue as this. I think that this problem
> demonstrates that when we treat certain junk in kernel address value
> as a userland
Thanks to the debug that Steven provided and to the help that I received from
Kostik, I think that now I understand the basic mechanics of this panic, but,
unfortunately, not the details of its root cause.
It seems like everything starts with some kind of a race between terminating
processes in a
- Original Message -
From: "Andriy Gapon"
To: "Steven Hartland"
Cc:
Sent: Wednesday, August 17, 2011 1:56 PM
Subject: Re: debugging frequent kernel panics on 8.2-RELEASE
on 17/08/2011 15:15 Steven Hartland said the following:
define allpcpu
set $i = 0
whil
on 17/08/2011 15:15 Steven Hartland said the following:
>> define allpcpu
>> set $i = 0
>> while ($i <= mp_maxid)
>> p *cpuid_to_pcpu[$i]
>> set $i = $i + 1
>> end
>> end
>> allpcpu
>
> Here's the output.
[snip]
> $3 = {pc_curthread = 0xff06b7f9c000, pc_idlethread = 0xff0012d85460,
> pc_fp
- Original Message -
From: "Andriy Gapon"
To: "Steven Hartland"
Cc:
Sent: Wednesday, August 17, 2011 12:12 PM
Subject: Re: debugging frequent kernel panics on 8.2-RELEASE
on 16/08/2011 23:43 Steven Hartland said the following:
- Original Message -
on 17/08/2011 14:12 Andriy Gapon said the following:
> A little bit later I will send you another patch that, I hope, will produce
> better
> diagnostics for this crash (without DDB in kernel).
The patch:
Index: sys/amd64/amd64/trap.c
==
on 16/08/2011 23:43 Steven Hartland said the following:
>
> - Original Message - From: "Andriy Gapon"
> To: "Steven Hartland"
> Cc:
> Sent: Tuesday, August 16, 2011 9:30 PM
> Subject: Re: debugging frequent kernel panics on 8.2-RELEASE
>
>
- Original Message -
From: "Andriy Gapon"
To: "Steven Hartland"
Cc:
Sent: Tuesday, August 16, 2011 9:30 PM
Subject: Re: debugging frequent kernel panics on 8.2-RELEASE
on 15/08/2011 17:56 Steven Hartland said the following:
(kgdb) x/512a 0xff8d8f357
on 15/08/2011 17:56 Steven Hartland said the following:
> (kgdb) x/512a 0xff8d8f357210
[snip]
Can you please also provide the following for this core?
list *vm_map_growstack+93
list *lim_cur+17
list *lim_rlimit+18
Also, it would be interesting to get panic output with DDB option.
--
Andriy
- Original Message -
From: "Andriy Gapon"
To: "Steven Hartland"
Cc:
Sent: Monday, August 15, 2011 4:36 PM
Subject: Re: debugging frequent kernel panics on 8.2-RELEASE
on 15/08/2011 17:56 Steven Hartland said the following:
- Original Message - From
on 15/08/2011 17:56 Steven Hartland said the following:
>
> - Original Message - From: "Andriy Gapon"
> To: "Steven Hartland"
> Cc:
> Sent: Monday, August 15, 2011 2:20 PM
> Subject: Re: debugging frequent kernel panics on 8.2-RELEASE
>
>
- Original Message -
From: "Andriy Gapon"
To: "Steven Hartland"
Cc:
Sent: Monday, August 15, 2011 2:20 PM
Subject: Re: debugging frequent kernel panics on 8.2-RELEASE
on 15/08/2011 15:51 Steven Hartland said the following:
- Original Message - From: &
on 15/08/2011 15:51 Steven Hartland said the following:
> - Original Message - From: "Andriy Gapon"
>
>
>> on 15/08/2011 13:34 Steven Hartland said the following:
>>> (kgdb) list *0x8053b691
>>> 0x8053b691 is in vm_fault (/usr/src/sys/vm/vm_fault.c:239).
>>> 234
- Original Message -
From: "Andriy Gapon"
on 15/08/2011 13:34 Steven Hartland said the following:
(kgdb) list *0x8053b691
0x8053b691 is in vm_fault (/usr/src/sys/vm/vm_fault.c:239).
234 /*
235 * Find the backing store object and offset into it
on 15/08/2011 13:34 Steven Hartland said the following:
> (kgdb) list *0x8053b691
> 0x8053b691 is in vm_fault (/usr/src/sys/vm/vm_fault.c:239).
> 234 /*
> 235 * Find the backing store object and offset into it to begin
> the
> 236 * search.
> 2
on 15/08/2011 13:34 Steven Hartland said the following:
> - Original Message - From: "Andriy Gapon"
>> I think (not 100% sure) that with DDB in kernel we could get a better
>> backtrace
>> here, possibly with pre-dblfault stack frames, because DDB backend is a bit
>> more
>> smarter than
- Original Message -
From: "Andriy Gapon"
We have 352 thread entries starting with:-
#0 sched_switch (td=0x8083e4e0, newtd=0xff0012d838c0,
flags=Variable "flags" is not available.
23 with:-
cpustop_handler () at atomic.h:285
and 16 with:-
#0 fork_trampoline () at /usr/src/s
on 14/08/2011 17:43 Steven Hartland said the following:
> - Original Message - From: "Andriy Gapon"
>>
>> Maybe test it on couple of machines first just in case I overlooked something
>> essential, although I have a report from another use that the patch didn't
>> break
>> anything for hi
- Original Message -
From: "Attilio Rao"
Anyway, we really would need much more information in order to take a
proactive action.
Would it be possible to access to one of the panic'ing machine? Is it
always the same panic which is happening or it is variadic (like: once
page fault, o
- Original Message -
From: "Andriy Gapon"
Maybe test it on couple of machines first just in case I overlooked something
essential, although I have a report from another use that the patch didn't break
anything for him (it was tested for an unrelated issue).
We've got this running on a
- Original Message -
From: "Rick Macklem"
Just a random thought that is probably not relevent, but...
Is it possible that some change for the upgrade is making the machines
run hotter and they're failing when they overhead?
The machines have full HW monitoring and we've not seen repo
Steven Hartland wrote:
> - Original Message -
> From: "Andriy Gapon"
>
> >>> I would really appreciate if you could try to reproduce the
> >>> problem with the patch that I sent earlier.
> >>
> >> Hi Andriy, what's the risk of this patch causing other issues?
> >
> > I can not estimate.
>
on 11/08/2011 20:14 Steven Hartland said the following:
> - Original Message - From: "Andriy Gapon"
>
I would really appreciate if you could try to reproduce the
problem with the patch that I sent earlier.
>>>
>>> Hi Andriy, what's the risk of this patch causing other issues?
>>
- Original Message -
From: "Andriy Gapon"
I would really appreciate if you could try to reproduce the
problem with the patch that I sent earlier.
Hi Andriy, what's the risk of this patch causing other issues?
I can not estimate.
The code is supposed to affect only things that happe
on 11/08/2011 19:37 Steven Hartland said the following:
> - Original Message - From: "Andriy Gapon"
>
>>
>> I would really appreciate if you could try to reproduce the problem with the
>> patch
>> that I sent earlier.
>
> Hi Andriy, what's the risk of this patch causing other issues?
I
- Original Message -
From: "Andriy Gapon"
I would really appreciate if you could try to reproduce the problem with the
patch
that I sent earlier.
Hi Andriy, what's the risk of this patch causing other issues?
I ask as to get results from this we've going to have to roll it
out to
on 11/08/2011 14:39 Steven Hartland said the following:
> The trimmed down output, removed the 10,000's of ?? lines here:-
> http://blog.multiplay.co.uk/dropzone/freebsd/panic-2011-08-11-1402.txt
>
> The raw output is here:-
> http://blog.multiplay.co.uk/dropzone/freebsd/panic-full-2011-08-11-1402
- Original Message -
From: "Andriy Gapon"
on 10/08/2011 18:35 Steven Hartland said the following:
Fatal double fault
...
#14 0x803a2cc9 in sched_switch (td=0x0, newtd=0x0, flags=Variable
"flags"
is not available.
)
at /usr/src/sys/kern/sched_ule.c:1852
Previous frame in
- Original Message -
From: "Jeremy Chadwick"
On Thu, Aug 11, 2011 at 09:59:36AM +0100, Steven Hartland wrote:
That's not the issue as its happening across board over 130 machines :(
Agreed, bad hardware sounds unlikely here. I could believe some strange
incompatibility (e.g. BIOS
2011/8/11 Jeremy Chadwick :
> On Thu, Aug 11, 2011 at 09:59:36AM +0100, Steven Hartland wrote:
>> That's not the issue as its happening across board over 130 machines :(
>
> Agreed, bad hardware sounds unlikely here. I could believe some strange
> incompatibility (e.g. BIOS quirk or the like[1]) t
On Thu, Aug 11, 2011 at 09:59:36AM +0100, Steven Hartland wrote:
> That's not the issue as its happening across board over 130 machines :(
Agreed, bad hardware sounds unlikely here. I could believe some strange
incompatibility (e.g. BIOS quirk or the like[1]) that might cause problems
en masse ac
That's not the issue as its happening across board over 130 machines :(
Regards
Steve
- Original Message -
From: "Attilio Rao"
I'd really point the finger to faulty hw.
Please run all the necessary diagnostic tools for catching it.
Attilio
=
I'd really point the finger to faulty hw.
Please run all the necessary diagnostic tools for catching it.
Attilio
2011/8/11 Andriy Gapon :
> on 10/08/2011 18:35 Steven Hartland said the following:
>> Fatal double fault
>> rip = 0x8052f6f1
>> rsp = 0xff86ce600fb0
>> rbp = 0xff86ce6
on 10/08/2011 18:35 Steven Hartland said the following:
> Fatal double fault
> rip = 0x8052f6f1
> rsp = 0xff86ce600fb0
> rbp = 0xff86ce601210
> cpuid = 0; apic id = 00
> panic: double fault
> cpuid = 0
> KDB: stack backtrace:
> #0 0x803af91e at kdb_backtrace+0x5e
> #1 0x
On Wed, Aug 10, 2011 at 05:26:27PM +0100, Steven Hartland wrote:
> - Original Message - From: "Jeremy Chadwick"
> free...@jdc.parodius.com
>
> >>>In combination with this, we use the following in /etc/rc.conf (the
> >>>dumpdev line is important, else savecore won't pick up anything):
> >>>
- Original Message -
From: "Jeremy Chadwick" free...@jdc.parodius.com
>In combination with this, we use the following in /etc/rc.conf (the
>dumpdev line is important, else savecore won't pick up anything):
>
>dumpdev="auto"
I thought this was ment to be the default from back in the 6.x
On Wed, Aug 10, 2011 at 04:46:17PM +0100, Steven Hartland wrote:
> >On Wed, Aug 10, 2011 at 03:22:52PM +0100, Steven Hartland wrote:
> >>The base stack reported is a double fault with no additional
> >>details and CTRL+ALT+ESC fails to break to the debugger as
> >>does and NMI, even though it at le
- Original Message -
From: "Jeremy Chadwick"
On Wed, Aug 10, 2011 at 03:22:52PM +0100, Steven Hartland wrote:
The base stack reported is a double fault with no additional
details and CTRL+ALT+ESC fails to break to the debugger as
does and NMI, even though it at least tries printing t
- Original Message -
From: "Steven Hartland"
To:
Sent: Wednesday, August 10, 2011 3:22 PM
Subject: debugging frequent kernel panics on 8.2-RELEASE
We're currently experiencing a large number of kernel panics
on FreeBSD 8.2-RELEASE across a large number of machines here.
The base s
on 10/08/2011 17:22 Steven Hartland said the following:
> The kernel is compiled with:-
> options KDB # Kernel debugger related code
> options KDB_TRACE # Print a stack trace for a panic
You also have to provide an actual debugger backend like built-in DDB or a stub
for remot
On Wed, Aug 10, 2011 at 03:22:52PM +0100, Steven Hartland wrote:
> The base stack reported is a double fault with no additional
> details and CTRL+ALT+ESC fails to break to the debugger as
> does and NMI, even though it at least tries printing the
> following many times some quite jumbled:-
> NMI .
68 matches
Mail list logo