Re: Generating NMI due to WDT expiry

2012-02-11 Thread Andriy Gapon
on 11/02/2012 00:42 Sushanth Rai said the following:
> Basically I would like to force system panic (and take kernel dump) when
> watchdog time expires. Assuming that timer expired due to some OS bug, kernel
> memory dump would be very useful. I'm running freebsd 7.2 on Intel IbexPeak
> chipset. According to specs, the watchdog timer on IbexPeak first generates
> an SMI and then resets the CPU. Since SMI is handled within the BIOS, is
> there a way to generate NMI from within BIOS SMI handler ? I see that kernel
> has support to either enter the debugger or force panic upon receipt of a
> NMI.
> 
> This is not necessarily a FreeBSD question, but would like to hear any
> thoughts/pointers.

See this:
http://www.intel.com/content/dam/doc/datasheet/5-chipset-3400-chipset-datasheet.pdf
Search for NMI2SMI_EN.  Maybe it's what you want.


-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-11 Thread Andriy Gapon
on 06/02/2012 09:04 Alexander Motin said the following:
> Hi.
> 
> I've analyzed scheduler behavior and think found the problem with HTT. 
> SCHED_ULE
> knows about HTT and when doing load balancing once a second, it does right
> things. Unluckily, if some other thread gets in the way, process can be easily
> pushed out to another CPU, where it will stay for another second because of 
> CPU
> affinity, possibly sharing physical core with something else without need.
> 
> I've made a patch, reworking SCHED_ULE affinity code, to fix that:
> http://people.freebsd.org/~mav/sched.htt.patch
> 
> This patch does three things:
>  - Disables strict affinity optimization when HTT detected to let more
> sophisticated code to take into account load of other logical core(s).
>  - Adds affinity support to the sched_lowest() function to prefer specified
> (last used) CPU (and CPU groups it belongs to) in case of equal load. Previous
> code always selected first valid CPU of evens. It caused threads migration to
> lower CPUs without need.
>  - If current CPU group has no CPU where the process with its priority can run
> now, sequentially check parent CPU groups before doing global search. That
> should improve affinity for the next cache levels.

Alexander,

I know that you are working on improving this patch and we have already
discussed some ideas via out-of-band channels.

Here's some additional ideas.  They are in part inspired by inspecting
OpenSolaris code.

Let's assume that one of the goals of a scheduler is to maximize system
performance / computational throughput[*].  I think that modern SMP-aware
schedulers try to employ the following two SMP-specific techniques to achieve 
that:
- take advantage of thread-to-cache affinity to minimize "cold cache" time
- distribute the threads over logical CPUs to optimize system resource usage by
minimizing[**] sharing of / contention over the resources, which could be
caches, instruction pipelines (for HTT threads), FPUs (for AMD Bulldozer
"cores"), etc.

1.  Affinity.
It seems that on modern CPUs the caches are either inclusive or some smart "as
if inclusive" caches.  As a result, if two cores have a shared cache at any
level, then it should be relatively cheap to move a thread from one core to the
other.  E.g. if logical CPUs P0 and P1 have private L1 and L2 caches and a
shared L3 cache, then on modern processors it should be much cheaper to move a
thread from P0 to P1 than to some processor P2 that doesn't share the L3 cache.

If this assumption is really true, then we can track only an affinity of a
thread with relation to a top level shared cache.  E.g. if migration within an
L3 cache is cheap, then we don't have any reason to constrain a migration scope
to an L2 cache, let alone L1.

2. Balancing.
I think that the current balancing code is pretty good, but can be augmented
with the following:
 A. The SMP topology in longer term should include other important shared
resources, not only caches.  We already have this in some form via
CG_FLAG_THREAD, which implies instruction pipeline sharing.

 B. Given the affinity assumptions, sched_pickcpu can pick the best CPU only
among CPUs sharing a top level cache if a thread still has an affinity to it or
among all CPUs otherwise.  This should reduce temporary imbalances.

 C. I think that we should eliminate the bias in the sched_lowest() family of
functions.  I like how your patch started addressing this.  For the cases where
the hint (cg_prefer) can not be reasonably picked it should be a pseudo-random
value.  OpenSolaris does it the following way:
http://fxr.watson.org/fxr/ident?v=OPENSOLARIS;im=10;i=CPU_PSEUDO_RANDOM

Footnotes:
[*] Goals of a scheduler could be controlled via policies.  E.g. there could be
a policy to reduce power usage.

[**] Given a possibility of different policies a scheduler may want to
concentrate threads.  E.g. if a system has two packages with two cores each and
there are two CPU-hungry threads, then the system may place them both on the
same package to reduce power usage.
Another interesting case is threads that share a VM space or otherwise share
some non-trivial amount of memory.  As you have suggested, it might make sense
to concentrate those threads so that they share a cache.
-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-11 Thread Alexander Motin

On 02/11/12 15:35, Andriy Gapon wrote:

on 06/02/2012 09:04 Alexander Motin said the following:

I've analyzed scheduler behavior and think found the problem with HTT. SCHED_ULE
knows about HTT and when doing load balancing once a second, it does right
things. Unluckily, if some other thread gets in the way, process can be easily
pushed out to another CPU, where it will stay for another second because of CPU
affinity, possibly sharing physical core with something else without need.

I've made a patch, reworking SCHED_ULE affinity code, to fix that:
http://people.freebsd.org/~mav/sched.htt.patch

This patch does three things:
  - Disables strict affinity optimization when HTT detected to let more
sophisticated code to take into account load of other logical core(s).
  - Adds affinity support to the sched_lowest() function to prefer specified
(last used) CPU (and CPU groups it belongs to) in case of equal load. Previous
code always selected first valid CPU of evens. It caused threads migration to
lower CPUs without need.
  - If current CPU group has no CPU where the process with its priority can run
now, sequentially check parent CPU groups before doing global search. That
should improve affinity for the next cache levels.


Alexander,

I know that you are working on improving this patch and we have already
discussed some ideas via out-of-band channels.


I've heavily rewritten the patch already. So at least some of the ideas 
are already addressed. :) At this moment I am mostly satisfied with 
results and after final tests today I'll probably publish new version.



Here's some additional ideas.  They are in part inspired by inspecting
OpenSolaris code.

Let's assume that one of the goals of a scheduler is to maximize system
performance / computational throughput[*].  I think that modern SMP-aware
schedulers try to employ the following two SMP-specific techniques to achieve 
that:
- take advantage of thread-to-cache affinity to minimize "cold cache" time
- distribute the threads over logical CPUs to optimize system resource usage by
minimizing[**] sharing of / contention over the resources, which could be
caches, instruction pipelines (for HTT threads), FPUs (for AMD Bulldozer
"cores"), etc.

1.  Affinity.
It seems that on modern CPUs the caches are either inclusive or some smart "as
if inclusive" caches.  As a result, if two cores have a shared cache at any
level, then it should be relatively cheap to move a thread from one core to the
other.  E.g. if logical CPUs P0 and P1 have private L1 and L2 caches and a
shared L3 cache, then on modern processors it should be much cheaper to move a
thread from P0 to P1 than to some processor P2 that doesn't share the L3 cache.


Absolutely true! On smack-mysql indexed select benchmarks I've found 
that on Atom CPU with two cores without L3 it is cheaper to move two 
mysql threads to one physical core (L2 cache) suffering from SMT, then 
bounce data between cores. Same time on Core i7 with shared L3 and also 
SMT results are strictly opposite.



If this assumption is really true, then we can track only an affinity of a
thread with relation to a top level shared cache.  E.g. if migration within an
L3 cache is cheap, then we don't have any reason to constrain a migration scope
to an L2 cache, let alone L1.


In present patch version I've implemented two different thresholds for 
the last level cache and for the rest. That's why I am waiting from you 
patch to properly detect cache topologies. :)



2. Balancing.
I think that the current balancing code is pretty good, but can be augmented
with the following:
  A. The SMP topology in longer term should include other important shared
resources, not only caches.  We already have this in some form via
CG_FLAG_THREAD, which implies instruction pipeline sharing.


At this moment I am using different penalty coefficients for SMT and 
shared caches (for unrelated processes sharing is is not good). No 
problem to add more types there. Separate flag for shared FPU could be 
used to have different penalty coefficients for usual threads and 
FPU-less kernel threads.



  B. Given the affinity assumptions, sched_pickcpu can pick the best CPU only
among CPUs sharing a top level cache if a thread still has an affinity to it or
among all CPUs otherwise.  This should reduce temporary imbalances.


I've done it in more complicated way. I am doing cache affinity with 
weight 2 to all paths with running _now_ threads of the same process and 
with weight 1 to the previous path where thread was running. I believe 
that constant cache trashing between two running threads is much worse 
then single jump from one CPU to another on context some switches. 
Though it could be made configurable.



  C. I think that we should eliminate the bias in the sched_lowest() family of
functions.  I like how your patch started addressing this.  For the cases where
the hint (cg_prefer) can not be reasonably picked it should be a pseudo-random
value.  OpenSola

Re: [RFT][patch] Scheduling for HTT and not only

2012-02-11 Thread Konstantin Belousov
On Sat, Feb 11, 2012 at 04:21:25PM +0200, Alexander Motin wrote:
> At this moment I am using different penalty coefficients for SMT and 
> shared caches (for unrelated processes sharing is is not good). No 
> problem to add more types there. Separate flag for shared FPU could be 
> used to have different penalty coefficients for usual threads and 
> FPU-less kernel threads.
It is very easy to record the fact of FPU access during the quantum on
the context switch-out. So you can at least distinguish numeric code.
This can be useful for bulldozer-like machines, if we ever want to
optimize for FPU on them.


pgpT7WJWTSJSk.pgp
Description: PGP signature


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-11 Thread Andriy Gapon
on 11/02/2012 15:35 Andriy Gapon said the following:
> It seems that on modern CPUs the caches are either inclusive or some smart "as
> if inclusive" caches.  As a result, if two cores have a shared cache at any
> level, then it should be relatively cheap to move a thread from one core to 
> the
> other.  E.g. if logical CPUs P0 and P1 have private L1 and L2 caches and a
> shared L3 cache, then on modern processors it should be much cheaper to move a
> thread from P0 to P1 than to some processor P2 that doesn't share the L3 cache

Having read this paper
http://www.cs.uwaterloo.ca/~brecht/courses/856/Possible-Readings/multicore/cache-performance-x86-2009.pdf
I think that I have been too optimistic about the smartness of caches in some
processors...

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Generating NMI due to WDT expiry

2012-02-11 Thread Sushanth Rai

I had looked at this. It seems to be doing the opposite of what I want. That 
is, it routes a NMI as an SMI and I need SMI to trigger an NMI. Watchdog timer 
on 3100 chipset had the ability to send either an NMI or SMI when the timer 
fired for the first time. I used NMI to generate kernel panic. With 3400 no 
longer generating NMI on WDT expiry, I'm trying to figure out how I can force 
memory dump on watchdog expiry.

Sushanth
 
--- On Sat, 2/11/12, Andriy Gapon  wrote:

> From: Andriy Gapon 
> Subject: Re: Generating NMI due to WDT expiry
> To: "Sushanth Rai" 
> Cc: freebsd-hackers@FreeBSD.org
> Date: Saturday, February 11, 2012, 3:06 AM
> on 11/02/2012 00:42 Sushanth Rai said
> the following:
> > Basically I would like to force system panic (and take
> kernel dump) when
> > watchdog time expires. Assuming that timer expired due
> to some OS bug, kernel
> > memory dump would be very useful. I'm running freebsd
> 7.2 on Intel IbexPeak
> > chipset. According to specs, the watchdog timer on
> IbexPeak first generates
> > an SMI and then resets the CPU. Since SMI is handled
> within the BIOS, is
> > there a way to generate NMI from within BIOS SMI
> handler ? I see that kernel
> > has support to either enter the debugger or force panic
> upon receipt of a
> > NMI.
> > 
> > This is not necessarily a FreeBSD question, but would
> like to hear any
> > thoughts/pointers.
> 
> See this:
> http://www.intel.com/content/dam/doc/datasheet/5-chipset-3400-chipset-datasheet.pdf
> Search for NMI2SMI_EN.  Maybe it's what you want.
> 
> 
> -- 
> Andriy Gapon
> ___
> freebsd-hackers@freebsd.org
> mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
>
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"