Re: [perf-discuss] Re: Puzzling scheduler behavior

2005-09-01 Thread jonathan chew

Dave,

It sounds like you have an interesting application.  You might want to 
create a processor set, leave some CPUs outside the psrset for other 
threads to run on, and run your application in a processor set to 
minimize interference from other threads.  As long as there are enough 
CPUs for your application in the psrset, you should see the number of 
migrations go down because there won't be any interference from other 
threads.


To get a better understanding of the Solaris performance optimizations 
done for NUMA, you might want to check out the overview of Memory 
Placement Optimization (MPO) at:


   http://opensolaris.org/os/community/performance/mpo_overview.pdf

The stickiness that you observed is because of MPO.  Binding to a 
processor set containing one CPU set the home lgroup of the thread to 
the lgroup containing that CPU and destroying the psrset just left the 
thread homed there.


Your shared memory is probably spread across the system already because 
the default MPO memory allocation policy for shared memory is to 
allocate the memory from random lgroups across the system.


We have some prototype observability tools which allow you to examine 
the lgroup hierarchy and it contents and observe and/or control how the 
threads and memory are placed among lgroups (see 
http://opensolaris.org/os/community/performance/numa/observability/).  
The source, binaries, and man pages are there.




Jonathan


David McDaniel (damcdani) wrote:


 Very, very enlightening, Eric. Its really terrific to have this kind
of channel for dialog.
 The "return to home base" behavior you describe is clearly consistent
with what I see and makes perfect sense.
 Let me followup with a question. In this application, processes have
not only their "own" memory, ie heap, stack program text and data, etc,
but they also share a moderately large (~ 2-5GB today) amount of memory
in the form of mmap'd files. From Sherry Moore's previous posts, I'm
assuming that at startup time that would actually be all allocated in
one board. Since I'm contemplating moving processes onto psrsets off
that board, would it be plausible to assume that I might get slightly
better net throughput if I could somehow spread that across all the
boards? I know its speculation of the highest order, so maybe my real
question is whether that's even worth testing.
 In any case, I'd love to turn the knob you mention and I'll look on
the performance community page and see what kind of trouble I can get
into. If there are any particular items you think I should check out,
guidance is welcome.
Regards
-d

 


-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of 
Eric C. Saxe

Sent: Thursday, September 01, 2005 1:48 AM
To: perf-discuss@opensolaris.org
Subject: [perf-discuss] Re: Puzzling scheduler behavior

Hi David,

Since your v1280 systems has NUMA characteristics, the bias 
that you see for one of the boards may be a result of the 
kernel trying to run your application's threads "close" to 
where they have allocated their memory. We also generally try 
to keep threads in the same process together, since they 
generally tend to work on the same data. This might explain 
why one of the boards is so much busier than the others. 

So yes, the interesting piece of this seems to be the higher 
than expected run queue wait time (latency) as seen via 
prstat -Lm. Even with the thread-to-board/memory affinity I 
mentioned above, it generally shouldn't be the case that 
threads are willing to hang out on a run queue waiting for a 
CPU on their "home" when that thread *could* actually run 
immediately on a "remote" (off-board) CPU.

Better to run remote, than not at all, or at least the saying goes :)

In the case where a thread is dispatched remotely because all 
home CPUs are busy, the thread will try to migrate back home 
the next time it comes through the dispatcher and finds it 
can run immediately at home (either because there's an idle 
CPU, or because one of the running threads is lower priority 
than us, and we can preempt it). This migrating around means 
that the thread will tend to spend more time waiting on run 
queues, since it has to either wait for the idle() thread to 
switch off, or for the lower priority thread it's able to 
preempt to surrender the CPU. Either way, the thread 
shouldn't have to wait long to get the CPU, but it will have 
to wait a non-zero amount of time.


What does the prstat -Lm output look like exactly? Is it a 
lot of wait time, or just more than you would expect?


By the way, just to be clear, when I say "board" what I 
should be saying is lgroup (or locality group). This is the 
Solaris abstraction for a set of CPU and memory resources 
that are close to one another. On your system, it turns out 
that kernel creates an lgroup for each board, and each thread 
is given an affinity for one of the lgroups, such that it 
will try to run on the CPUs (and allocate memory from that 
group

Re: [perf-discuss] Re: Puzzling scheduler behavior

2005-09-01 Thread jonathan chew

David McDaniel (damcdani) wrote:


 Thanks, Jonathon for the good insights. I'll be digging into the
references you mentioned. Yes, at the end of the day I'm sure binding to
processor sets is part of the plan; having already done so in a rather
rote way I can demonstrate a very dramatic reduction in apparent cpu
utilzation, on the order of 25-30%. But before I commit engineers to
casting something in stone I want to make sure I understand the defaults
and the side effects of doing so since it potentially results in
defeating other improvements that Sun has done or will be doing. 
 



Sure.  No problem.  The overview and man pages for our tools are pretty 
short.  The tools are very easy to use and kind of fun to play with.  
I'm going to try to post a good example of how to use them later today.


I think that using a psrset is an interesting experiment to see whether 
interference is a big factor in all the migrations.  It would be nice 
not to have to do that by default though.


It sounds like you already tried this experiment though and noticed a 
big difference.  Did the migrations drop dramatically?  What else is 
running on the system when you don't use a psrset?



Jonathan


-Original Message-
From: jonathan chew [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 01, 2005 11:50 AM

To: David McDaniel (damcdani)
Cc: Eric C. Saxe; perf-discuss@opensolaris.org
Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior

Dave,

It sounds like you have an interesting application.  You 
might want to create a processor set, leave some CPUs outside 
the psrset for other threads to run on, and run your 
application in a processor set to minimize interference from 
other threads.  As long as there are enough CPUs for your 
application in the psrset, you should see the number of 
migrations go down because there won't be any interference 
from other threads.


To get a better understanding of the Solaris performance 
optimizations done for NUMA, you might want to check out the 
overview of Memory Placement Optimization (MPO) at:


   http://opensolaris.org/os/community/performance/mpo_overview.pdf

The stickiness that you observed is because of MPO.  Binding 
to a processor set containing one CPU set the home lgroup of 
the thread to the lgroup containing that CPU and destroying 
the psrset just left the thread homed there.


Your shared memory is probably spread across the system 
already because the default MPO memory allocation policy for 
shared memory is to allocate the memory from random lgroups 
across the system.


We have some prototype observability tools which allow you to 
examine the lgroup hierarchy and it contents and observe 
and/or control how the threads and memory are placed among 
lgroups (see 
http://opensolaris.org/os/community/performance/numa/observabi
lity/).  
The source, binaries, and man pages are there.




Jonathan


David McDaniel (damcdani) wrote:

   

Very, very enlightening, Eric. Its really terrific to have 
 

this kind 
   


of channel for dialog.
The "return to home base" behavior you describe is clearly 
 

consistent 
   


with what I see and makes perfect sense.
Let me followup with a question. In this application, 
 

processes have 
   

not only their "own" memory, ie heap, stack program text and 
 

data, etc, 
   

but they also share a moderately large (~ 2-5GB today) 
 

amount of memory 
   

in the form of mmap'd files. From Sherry Moore's previous posts, I'm 
assuming that at startup time that would actually be all 
 

allocated in 
   

one board. Since I'm contemplating moving processes onto psrsets off 
that board, would it be plausible to assume that I might get 
 

slightly 
   

better net throughput if I could somehow spread that across all the 
boards? I know its speculation of the highest order, so 
 

maybe my real 
   


question is whether that's even worth testing.
In any case, I'd love to turn the knob you mention and 
 

I'll look on 
   

the performance community page and see what kind of trouble 
 

I can get 
   

into. If there are any particular items you think I should 
 

check out, 
   


guidance is welcome.
Regards
-d



 


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Eric C. 
Saxe

Sent: Thursday, September 01, 2005 1:48 AM
To: perf-discuss@opensolaris.org
Subject: [perf-discuss] Re: Puzzling scheduler behavior

Hi David,

Since your v1280 systems has NUMA characteristics, the bias 
   

that you 
   

see for one of the boards may be a result of the kernel 
   

trying to run 
   

your application's threads "close" to where they have 
   

allocated their 
   

memory. We also generally try to keep threads in the same process 
together, since they generally tend to work on the same data. This 
might explain why one

Re: [perf-discuss] Re: Puzzling scheduler behavior

2005-09-09 Thread jonathan chew

Dave,

Sorry, I forgot to reply to this sooner.  Yes, I was just curious what 
else was running to see whether we would expect your application to be 
perturbed much.


There could be a load imbalance due to the daemons throwing everything 
off once in awhile.  This could be affecting how the threads in your 
application are distributed across the nodes in your NUMA machine.


Each thread is assigned a home locality group upon creation and the 
kernel will tend to run it on CPUs in its home lgroup and allocate its 
memory there to minimize latency and maximize performance by default.  
There is an lgroup corresponding to each of the nodes (boards) in your 
NUMA machine.  The assignment of threads to lgroups is based on lgroup 
load averages, so other threads may cause the lgroup load average to go 
up or down and thus affect how threads are placed among lgroups.


You can use plgrp(1) which is available on our NUMA observability web 
page at 
http://opensolaris.org/os/community/performance/numa/observability to 
see where your application processes/threads are homed.  Then we can see 
whether they are distributed very well.  You can also use plgrp(1) to 
change the home lgroup of a thread, but should be careful because there 
can be side effects as explained in the example referred to below.


There are man pages, source, and binaries for our tools on the web 
page.  I wrote up a good example of how to use the tools to understand, 
observe, and affect thread and memory placement among lgroups on a NUMA 
machine and posted it on the web page in 
http://opensolaris.org/os/community/performance/example.txt.


You can also try using the lgrp_expand_proc_thresh tunable that Eric 
suggested last week.


Are the migrations that you are seeing when not running a psrset causing 
a performance problem for your application?




Jonathan


David McDaniel (damcdani) wrote:


 When using prsets, the migrations and involuntary context switches go
essentially to zero. As far as "other stuff", not quite sure what you
mean, but this application runs on a dedicated server so there is no
stuff of a casueal nature, however there is a lot of what I'll glom into
the category of "support" tasks, ie ntp daemons, nscd flushing caches,
fsflush running around backing up pages, etc. Was that what you meant? 

 


-Original Message-
From: jonathan chew [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 01, 2005 12:45 PM

To: David McDaniel (damcdani)
Cc: Eric C. Saxe; perf-discuss@opensolaris.org
Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior

David McDaniel (damcdani) wrote:

   

Thanks, Jonathon for the good insights. I'll be digging into the 
references you mentioned. Yes, at the end of the day I'm 
 

sure binding 
   

to processor sets is part of the plan; having already done so in a 
rather rote way I can demonstrate a very dramatic reduction 
 

in apparent 
   

cpu utilzation, on the order of 25-30%. But before I commit 
 

engineers 
   

to casting something in stone I want to make sure I understand the 
defaults and the side effects of doing so since it 
 

potentially results 
   


in defeating other improvements that Sun has done or will be doing.


 

Sure.  No problem.  The overview and man pages for our tools 
are pretty short.  The tools are very easy to use and kind of 
fun to play with.  
I'm going to try to post a good example of how to use them 
later today.


I think that using a psrset is an interesting experiment to 
see whether interference is a big factor in all the 
migrations.  It would be nice not to have to do that by 
default though.


It sounds like you already tried this experiment though and 
noticed a big difference.  Did the migrations drop 
dramatically?  What else is running on the system when you 
don't use a psrset?



Jonathan

   


-Original Message-
From: jonathan chew [mailto:[EMAIL PROTECTED]
Sent: Thursday, September 01, 2005 11:50 AM
To: David McDaniel (damcdani)
Cc: Eric C. Saxe; perf-discuss@opensolaris.org
Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior

Dave,

It sounds like you have an interesting application.  You 
   

might want to 
   

create a processor set, leave some CPUs outside the psrset 
   

for other 
   

threads to run on, and run your application in a processor set to 
minimize interference from other threads.  As long as there 
   

are enough 
   

CPUs for your application in the psrset, you should see the 
   

number of 
   

migrations go down because there won't be any interference 
   

from other 
   


threads.

To get a better understanding of the Solaris performance 
   

optimizations 
   

done for NUMA, you might want to check out the overview of Memory 
Placement Optimization (MPO) at:


  http://opensolaris.org/os/community/performance/mpo_overview.pdf

The stickiness that you observed is because of MPO.  Bind

Re: [perf-discuss] NUMA ptools and ISM segments

2005-09-16 Thread jonathan chew

Marc Rocas wrote:

I've been playing around with the tools on a Stinger box and I think 
their pretty cool!



I'm glad that you like them.  We like them too and think that they are 
fun to play with besides being useful for observability and 
experimenting with performance.  I hope that our MPO overview, man 
pages, and our examples on how to use the tools were helpful.


The one question I have is whether ISM (Intimate Shared Memory) 
segments are immune to being coerced to relocate via pmadvise(3c)? 
I've tried it without success. A quick look at the seg_spt.c code 
seemed to indicate that when an spt segment is created its lgroup 
policy is set to LGRP_MEM_POLICY_DEFAULT that will result in 
randomized allocation for segments greater than 8MB. I've verified as 
much using the NUMA enhanced pmap(3c) command.



It sounds like Eric Lowe has a theory as to why madvise(MADV_ACCESS_*) 
and pmadvise(1) didn't work for migrating your ISM segment.  Joe and 
Nils are experts on the x86/AMD64 HAT and may be able to comment on 
Eric's theory that the lack of dynamic ISM unmap is preventing page 
migration from working.


I'll see whether I can reproduce the problem.

Which version of Solaris are you using?


I have an ISM segment that gets consumed by HW that is limited to 
32-bit addressing and thus have a need to control the physical range 
that backs the segment. At this point, it would seem that I need to 
allocate the memory (about 300MB) in the kernel and map it back to 
user-land but I would lose the use of 2MB pages since I have not quite 
figured out how to allocate memory using a particular page size. Have 
I misinterpreted the code? Do I have other options?



Bart's suggestion is the simplest (brute force) way.  Eric's suggestion 
sounded a little painful.  I have a couple of other options below, but 
can't think of a nice way to specify that your application needs low 
physical memory.  So, I want to understand what you are doing better to 
see if there is any better way.


Can you please tell me more about your situation and requirements?  What 
is the hardware that needs the 32-bit addressing (framebuffer, device 
needing DMA, etc.)?  Besides needing to be in low physical memory, does 
it need to be wired down and shared?




Jonathan

PS
Assuming that you can change your code, here are a couple of other 
options that are less painful than Eric's suggestion but still aren't 
very elegant because of the low physical memory requirement:


- Use DISM instead of ISM which is Dynamic Intimate Shared Memory and is 
pageable (see SHM_PAGEABLE flag to shmat(2))


OR

- Use  mmap(2) and the MAP_ANON (and MAP_SHARED if you need shared 
memory) flag to allocate (shared) anonymous memory


- Call memcntl(2) with MC_HAT_ADVISE to specify that you want large pages

AND

- Call madvise(MADV_ACCESS_LWP) on your mmap-ed or DISM segment to say 
that the next thread to access it will use it a lot


- Access it from CPU 0 on your Stinger.  I don't like this part because 
this is hardware implementation specific.  It turns out that the low 
physical memory usually lives near CPU 0 on an Opteron box.  You can use 
liblgrp(3LIB) to discover which leaf lgroup contains CPU 0 and 
lgrp_affinity_set(3LGRP) to set a strong affinity for that lgroup (which 
will set the home lgroup for the thread to that lgroup).  Alternatively, 
you can use processor_bind(2) to bind/unbind to CPU 0.


- Use the MC_LOCK flag to memcntl(2) to lock down the memory for your  
segment if you want the physical memory to stay there (until you unlock it)


___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] NUMA ptools and ISM segments

2005-09-16 Thread jonathan chew

PPS
I forgot to ask what you did to test whether pmadvise(1) would migrate 
your ISM segment and how you know that it didn't.



jonathan chew wrote:


Marc Rocas wrote:

I've been playing around with the tools on a Stinger box and I think 
their pretty cool!




I'm glad that you like them.  We like them too and think that they are 
fun to play with besides being useful for observability and 
experimenting with performance.  I hope that our MPO overview, man 
pages, and our examples on how to use the tools were helpful.


The one question I have is whether ISM (Intimate Shared Memory) 
segments are immune to being coerced to relocate via pmadvise(3c)? 
I've tried it without success. A quick look at the seg_spt.c code 
seemed to indicate that when an spt segment is created its lgroup 
policy is set to LGRP_MEM_POLICY_DEFAULT that will result in 
randomized allocation for segments greater than 8MB. I've verified as 
much using the NUMA enhanced pmap(3c) command.




It sounds like Eric Lowe has a theory as to why madvise(MADV_ACCESS_*) 
and pmadvise(1) didn't work for migrating your ISM segment.  Joe and 
Nils are experts on the x86/AMD64 HAT and may be able to comment on 
Eric's theory that the lack of dynamic ISM unmap is preventing page 
migration from working.


I'll see whether I can reproduce the problem.

Which version of Solaris are you using?


I have an ISM segment that gets consumed by HW that is limited to 
32-bit addressing and thus have a need to control the physical range 
that backs the segment. At this point, it would seem that I need to 
allocate the memory (about 300MB) in the kernel and map it back to 
user-land but I would lose the use of 2MB pages since I have not 
quite figured out how to allocate memory using a particular page 
size. Have I misinterpreted the code? Do I have other options?




Bart's suggestion is the simplest (brute force) way.  Eric's 
suggestion sounded a little painful.  I have a couple of other options 
below, but can't think of a nice way to specify that your application 
needs low physical memory.  So, I want to understand what you are 
doing better to see if there is any better way.


Can you please tell me more about your situation and requirements?  
What is the hardware that needs the 32-bit addressing (framebuffer, 
device needing DMA, etc.)?  Besides needing to be in low physical 
memory, does it need to be wired down and shared?




Jonathan

PS
Assuming that you can change your code, here are a couple of other 
options that are less painful than Eric's suggestion but still aren't 
very elegant because of the low physical memory requirement:


- Use DISM instead of ISM which is Dynamic Intimate Shared Memory and 
is pageable (see SHM_PAGEABLE flag to shmat(2))


OR

- Use  mmap(2) and the MAP_ANON (and MAP_SHARED if you need shared 
memory) flag to allocate (shared) anonymous memory


- Call memcntl(2) with MC_HAT_ADVISE to specify that you want large pages

AND

- Call madvise(MADV_ACCESS_LWP) on your mmap-ed or DISM segment to say 
that the next thread to access it will use it a lot


- Access it from CPU 0 on your Stinger.  I don't like this part 
because this is hardware implementation specific.  It turns out that 
the low physical memory usually lives near CPU 0 on an Opteron box.  
You can use liblgrp(3LIB) to discover which leaf lgroup contains CPU 0 
and lgrp_affinity_set(3LGRP) to set a strong affinity for that lgroup 
(which will set the home lgroup for the thread to that lgroup).  
Alternatively, you can use processor_bind(2) to bind/unbind to CPU 0.


- Use the MC_LOCK flag to memcntl(2) to lock down the memory for your  
segment if you want the physical memory to stay there (until you 
unlock it)


___
perf-discuss mailing list
perf-discuss@opensolaris.org



___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] Re: Puzzling scheduler behavior

2005-09-16 Thread jonathan chew

David McDaniel (damcdani) wrote:


 Thanks for the feedback, Jonathan. I've got it on my todo list to get
those tools and go spelunking a bit. I cant really say that we have a
performance problem, its more along the lines of me trying to use the
greatly improved observability tools in Solaris to get a better
understanding of things. In any case, its pretty much relegated to a
science project right now because we cant ship anything that's not part
of some "official" distribution?  
 



Ok.  The tools are pretty easy to use.  If you have any questions, we 
would be happy to help and welcome any feedback on the tools or 
documentation.


When you say that you can't ship anything that's not part of some 
"official" distribution, are you referring to our tools or your software?


I am suggesting using our tools to understand the behavior of your 
application and its interaction with the operating system better and 
determine whether there is a problem or not.  If there is a problem in 
the OS, we can try to fix the default behavior.


As Sasha pointed out, it is our intention to ship our observability 
tools, but we wanted to let the OpenSolaris community try them first to 
see whether they are useful.


Last but not least, we can try running your application if you want.



Jonathan


-----Original Message-
From: jonathan chew [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 09, 2005 6:08 PM

To: David McDaniel (damcdani)
Cc: Eric C. Saxe; perf-discuss@opensolaris.org
Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior

Dave,

Sorry, I forgot to reply to this sooner.  Yes, I was just 
curious what else was running to see whether we would expect 
your application to be perturbed much.


There could be a load imbalance due to the daemons throwing 
everything off once in awhile.  This could be affecting how 
the threads in your application are distributed across the 
nodes in your NUMA machine.


Each thread is assigned a home locality group upon creation 
and the kernel will tend to run it on CPUs in its home lgroup 
and allocate its memory there to minimize latency and 
maximize performance by default.  
There is an lgroup corresponding to each of the nodes 
(boards) in your NUMA machine.  The assignment of threads to 
lgroups is based on lgroup load averages, so other threads 
may cause the lgroup load average to go up or down and thus 
affect how threads are placed among lgroups.


You can use plgrp(1) which is available on our NUMA 
observability web page at 
http://opensolaris.org/os/community/performance/numa/observabi
lity to see where your application processes/threads are 
homed.  Then we can see whether they are distributed very 
well.  You can also use plgrp(1) to change the home lgroup of 
a thread, but should be careful because there can be side 
effects as explained in the example referred to below.


There are man pages, source, and binaries for our tools on 
the web page.  I wrote up a good example of how to use the 
tools to understand, observe, and affect thread and memory 
placement among lgroups on a NUMA machine and posted it on 
the web page in 
http://opensolaris.org/os/community/performance/example.txt.


You can also try using the lgrp_expand_proc_thresh tunable 
that Eric suggested last week.


Are the migrations that you are seeing when not running a 
psrset causing a performance problem for your application?




Jonathan


David McDaniel (damcdani) wrote:

   

When using prsets, the migrations and involuntary context 
 


switches go
   


essentially to zero. As far as "other stuff", not quite sure what you
mean, but this application runs on a dedicated server so there is no
stuff of a casueal nature, however there is a lot of what 
 


I'll glom into
   

the category of "support" tasks, ie ntp daemons, nscd 
 


flushing caches,
   

fsflush running around backing up pages, etc. Was that what 
     

you meant? 
   




 


-Original Message-
From: jonathan chew [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 01, 2005 12:45 PM

To: David McDaniel (damcdani)
Cc: Eric C. Saxe; perf-discuss@opensolaris.org
Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior

David McDaniel (damcdani) wrote:

  

   

Thanks, Jonathon for the good insights. I'll be digging into the 
references you mentioned. Yes, at the end of the day I'm 


 

sure binding 
  

   

to processor sets is part of the plan; having already done so in a 
rather rote way I can demonstrate a very dramatic reduction 


 

in apparent 
  

   

cpu utilzation, on the order of 25-30%. But before I commit 


 

engineers 
  

   

to casting something in stone I want to make sure I understand the 
defaults and the side effects of doing so since it 


 

potentially results 
  

   


in defeating other improvements that Sun has don

Re: [perf-discuss] NUMA ptools and ISM segments

2005-09-19 Thread jonathan chew

Joe Bonasera wrote:


jonathan chew wrote:




It sounds like Eric Lowe has a theory as to why 
madvise(MADV_ACCESS_*) and pmadvise(1) didn't work for migrating your 
ISM segment.  Joe and Nils are experts on the x86/AMD64 HAT and may 
be able to comment on Eric's theory that the lack of dynamic ISM 
unmap is preventing page migration from working.





Ok.. I've missed the start of this thread, having been distracted by
a million other things just now..

Can someone update with the problem what Eric's theory was?



I've attached Eric Lowe's email.  The relevant part was the following:

segspt_shmadvise() implements the NUMA migration policies for all shared memory
types so next touch (MADV_ACCESS_LWP) should work.  Unfortunately, the x64
HAT layer does not not implement dynamic ISM unmap.  Since the NUMA migration
code is driven by minor faults (the definition of next touch depends on that)
I suspect that is why madvise() does not work for ISM on that machine.



Jonathan
--- Begin Message ---
On Thu, Sep 15, 2005 at 11:55:21PM -0400, Marc Rocas wrote:
| 
| The one question I have is whether ISM (Intimate Shared Memory) segments are 
| immune to being coerced to relocate via pmadvise(3c)? I've tried it without 
| success. A quick look at the seg_spt.c code seemed to indicate that when an 
| spt segment is created its lgroup policy is set to LGRP_MEM_POLICY_DEFAULT 
| that will result in randomized allocation for segments greater than 8MB. 
| I've verified as much using the NUMA enhanced pmap(3c) command.

You are correct about the policy.  Since shared memory is usually, uh, shared :)
we spread it around to prevent hot-spotting.

segspt_shmadvise() implements the NUMA migration policies for all shared memory
types so next touch (MADV_ACCESS_LWP) should work.  Unfortunately, the x64
HAT layer does not not implement dynamic ISM unmap.  Since the NUMA migration
code is driven by minor faults (the definition of next touch depends on that)
I suspect that is why madvise() does not work for ISM on that machine.

| I have an ISM segment that gets consumed by HW that is limited to 32-bit 
| addressing and thus have a need to control the physical range that backs the 
| segment. At this point, it would seem that I need to allocate the memory 
| (about 300MB) in the kernel and map it back to user-land but I would lose 
| the use of 2MB pages since I have not quite figured out how to allocate 
| memory using a particular page size. Have I misinterpreted the code? Do I 
| have other options?

The twisted maze of code for ISM allocates its memory (ironically, since it's
always locked and hence doesn't need swap) through anon. There is no way
to restrict the segment to a specific PA range or otherwise impact the
allocation path until the pages are already (F_SOFTLOCK) faulted in.  The
NUMA code might help in some cases because the first lgrp happens to fall
under 4G :) so you can probably hack your way through this at the application
layer by changing the shared memory random threshold to ULONG_MAX in 
/etc/system,
binding your application thread to a CPU in the correct lgroup which has memory
only below 4G, and then doing the shmat() from there.  It's a ginormous hack,
but it will get you the results you want. :)
 
Once sysV shared memory is redesigned to not rely on the anon layer doing this
sort of thing in the kernel should become a lot easier.

Being able to specify where user memory ends up in PA space to avoid copying
it on I/O is an RFE we simply have never thought about before.  The lgroup
code is careful to avoid specifics like memory addresses, since lgrp-physical
mappings are very machine specific, so making this sort of thing work would
require adding a whole new set of segment advice ops which are used by the
physical memory allocator itself; page_create_va() takes the segment as one
of its arguments, so if we stuffed the segment PA range advice into the segment
we could dig it up down there and request memory in the "correct" range from
freelists.

-- 
Eric Lowe   Solaris Kernel Development  Austin, Texas
Sun Microsystems.  We make the net work.x64155/+1(512)401-1155
___
perf-discuss mailing list
perf-discuss@opensolaris.org
--- End Message ---
___
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] NUMA ptools and ISM segments

2005-09-19 Thread jonathan chew

Joe Bonasera wrote:




The x86 HAT (D)ISM code is tricky. Here's why:

pagesize is 4K, a large page is 2Meg (usually).

So a pagetable covers either an entire 512 4K pages
aligned at 2M or 512 2M pages aligned at 1Gig.

To share a page table, the (D)ISM segment has to be
either a multiple of (512) 4K pages aligned to a 2Meg
boundar or a multiple of 512 2Meg pages aligned to
a 1Gig boundary.

When those alignment / size restrictions aren't met, then
we can't really share the pagetables. So when you attach
to the (d)ISM segment, we copy the values needed for the
mappings into process local pagetables - ie. not really shared.



Thanks for pointing out the alignment and size restrictions for (D)ISM 
segments.  If you specify 0 for the address to shmat(2) and use a 
multiple of 2 meg or 1 gig for the size, should you get a segment that 
is properly aligned?




I suspect the minor fault stuff doesn't work right, because to
unload the mappings you'd have to hat_unshare() in all processes
that have the DISM mapped in order to migrate it.



Is hat_pageunload() sufficient?



Jonathan

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] NUMA ptools and ISM segments

2005-09-26 Thread jonathan chew

Marc Rocas wrote:


All,

I ended up creating a DISM segment, migrating the creating thread to 
the lgrp containing cpu0, madvising, touching the memory, and 
MC_LOCK'ing per Jonathan's suggestions and it all worked out. The 
memory was exclusively allocated from lgrp 1 (cpu0) using 2M pages per 
the output of 'pmap -Ls $pid'.


To not completely hard-code cpu0, I allocated a 32MB ISM segment and 
walk it page by page looking for PA that is within the first 4GB and 
keep track of what lgrp it comes from to generate a list of potential 
lgrps to use and pick the one that has the most free memory to 
allocate from. Once I've collected the information I destroy the ISM 
segment. In my case, it turns out to be cpu0 as it is an Opteron box 
and as Jonathan pointed out previously the lower PA range is 
associated with cpu0. The  discovery  algorithm is not perfect as it  
depends on ISM segments being randomized but hopefully that's one 
aspect of the implementation that will not change anytime soon.



There may be a slightly better way to allocate memory from the lgroup 
containing the least significant physical memory using 
lgrp_affinity_set(3LGRP), meminfo(2), and madvise(MADV_ACCESS_LWP).


Unfortunately, none of these ways are guaranteed to always allocate 
memory in the least significant 4 gigabytes of physical memory.  For 
example, if the node with the least significant physical memory contains 
more than 4 gigabytes of RAM, physical memory may come from this node, 
but be above the least significant 4 gigabytes.  :-(


I spoke to one of our I/O guys to see whether there is a prescribed way 
to allocate physical memory in the least significant 4 gig for DMA from 
userland.  Solaris doesn't provide an existing way.  The philosophy is 
that users shouldn't have to care and that the I/O buffer will be copied 
to/from the least significant physical memory inside the kernel if the 
device can only reach that far to DMA.  I think that you may be able to 
write a driver that can allocate physical memory in the desired range 
and allow access to it with mmap(2) or ioctl(2).


I'd like to understand more about your situation to get a better idea of 
the constraints and whether there is a viable solution or the need for one.


What is your device?  Can you afford to let the kernel copy the user's 
buffer into another buffer which is below 4 gig to DMA to your device?  
If not, would it make sense to write your own driver?



By the way, the lgrp API and meminfo(2) and the information provided 
by everyone made it pretty easy. Thanks a bunch. Once we start playing 
around with our performance suites I may have further feedback on the 
tools and or questions.



I'm glad that you found the lgroup APIs and meminfo(2) easy to use.

Please let us know whether you have any more questions or feedback.



Jonathan



On 9/16/05, *Marc Rocas* <[EMAIL PROTECTED] 
<mailto:[EMAIL PROTECTED]>> wrote:


Eric, Bart, and Jonathan,

Thanks for your quick replies. See my comments below:

On 9/16/05, *jonathan chew* < [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:

Marc Rocas wrote:


I've been playing around with the tools on a Stinger box and

I think

their pretty cool!



I'm glad that you like them.  We like them too and think that
they are
fun to play with besides being useful for observability and
experimenting with performance.  I hope that our MPO overview,
man
pages, and our examples on how to use the tools were helpful.


All of the above were pretty useful in getting up to speed and I
will use them to do performance experiments. I simply need to
bring up our system in the stinger box in order to run our perf
suites.


The one question I have is whether ISM (Intimate Shared Memory)
segments are immune to being coerced to relocate via

pmadvise(3c)?

I've tried it without success. A quick look at the seg_spt.c code
seemed to indicate that when an spt segment is created its lgroup
policy is set to LGRP_MEM_POLICY_DEFAULT that will result in
randomized allocation for segments greater than 8MB. I've

verified as

much using the NUMA enhanced pmap(3c) command.



It sounds like Eric Lowe has a theory as to why
madvise(MADV_ACCESS_*)
and pmadvise(1) didn't work for migrating your ISM
segment.  Joe and
Nils are experts on the x86/AMD64 HAT and may be able to
comment on
Eric's theory that the lack of dynamic ISM unmap is preventing
page
migration from working.

I'll see whether I can reproduce the problem.

Which version of Solaris are you using?


Solaris 10 GA with no kernel patches applied.


I have an ISM segment that gets consumed by HW that is

limited to

32-bit addressing and thus have a need

Re: [perf-discuss] NUMA ptools and ISM segments

2005-10-04 Thread Jonathan Chew
Marc Rocas wrote On 09/27/05 21:27,:
> 
> On 9/26/05, *jonathan chew* <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
> 
> 
> There may be a slightly better way to allocate memory from the lgroup
> containing the least significant physical memory using
> lgrp_affinity_set(3LGRP), meminfo(2), and madvise(MADV_ACCESS_LWP).
> 
> Unfortunately, none of these ways are guaranteed to always allocate
> memory in the least significant 4 gigabytes of physical memory.  For
> example, if the node with the least significant physical memory contains
> more than 4 gigabytes of RAM, physical memory may come from this node,
> but be above the least significant 4 gigabytes.  :-(
> 
> 
> Fortunately,  we control the HW our systems are deployed on and can
> enforce a 4GB RAM limit if required.

Ok.  That definitely makes things easier.


> I spoke to one of our I/O guys to see whether there is a prescribed way
> to allocate physical memory in the least significant 4 gig for DMA from
> userland.  Solaris doesn't provide an existing way.  The philosophy is
> that users shouldn't have to care and that the I/O buffer will be copied
> to/from the least significant physical memory inside the kernel if the
> device can only reach that far to DMA.  I think that you may be able to
> write a driver that can allocate physical memory in the desired range
> and allow access to it with mmap(2) or ioctl(2).
> 
> 
> We already have such a driver but have not found a way to force the use
> of 2M pages!  Is there  a new DDI interface to request large page size?

I'm not sure, but can try to find out.

What does the driver use to allocate the memory (eg. ddi_dma_mem_alloc(9F))?


> I'd like to understand more about your situation to get a better idea of
> the constraints and whether there is a viable solution or the need
> for one.
> 
> What is your device?  Can you afford to let the kernel copy the user's
> buffer into another buffer which is below 4 gig to DMA to your device?
> If not, would it make sense to write your own driver?
> 
> 
> Not really.  We buffer data up and need to have it DMA in real-time to
> our device which futher processes it and passes on the processed data to
> another machine. The window of time once we have committed to delivering
> the data is strictly enforced and failure to do so effectively shuts
> down the other system. Way back in SunOS 5.4, we went as far as writing
> a pseudo driver to SOFTLOCK memory as we found that it was not enough to
> mlock() memory since we took page faults on the corresponding TTEs. As I
> noted previously in the beginning of this thread, we use our own version
> of physio() that assumes properly wired down memory and thus differs
> from the stock version in that it does not bother with the locking logic
> at all.


Ok.  I see.


> By writing our own device driver, do you mean one to export 4GB PA range
> to user-land or our own segment driver?

I mean whether your driver can allocate the DMA buffer in the right
place and provide acccess to it through mmap(2) or ioctl(2).  It sounds
like you have a driver like that already, but want the buffer to be on
large pages in the user application.

Is that right?



Jonathan

___
perf-discuss mailing list
perf-discuss@opensolaris.org


[perf-discuss] MPO design review

2006-06-28 Thread Jonathan . Chew
I have to give a presentation to update our Platform Software 
Architecture Review Committee (PSARC) on MPO and thought that this might 
be a good time to let others in the OpenSolaris (performance) community 
to comment too.


The presentation is based on the MPO overview in:

   http://www.opensolaris.org/os/community/performance/mpo_overview.pdf

If you have seen that already, this presentation has an issues and 
improvements slide for lgroups, initial placement, scheduling, VM, APIs, 
and tools and then has some new slides on the issues affecting all of 
MPO and on future work.  This *new* presentation is posted at:


   http://www.opensolaris.org/os/community/performance/numa/mpo_update.pdf
or
   http://www.opensolaris.org/os/community/performance/numa/mpo_update.sxi

NOTE: It contains some pictures that older PDF readers may not display 
correctly, so I have included the StarOffice version which does display 
correctly just in case.



I welcome any questions, comments, or feedback regarding the past, 
present, or future of MPO and its design and implementation.




Jonathan
___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] lgrp kstats

2006-12-11 Thread Jonathan . Chew

Henrik Loef wrote:


Hi,

I'm trying to repeat some old experiments I did a couple of months ago an a SF15K machine. We used kstats (module lgrp: "pages migrated to","pages migrated from") to measure the number of pages migrated using madvise(). The domain we used for our previous experiments has been shut-down so now we are using another domain where these kstats seem to be deactivated. 


Does anybody know for which releases of Solaris these counters are enabled?

Note: we were involved in the beta program so the release we used might have 
been a beta.
 



I think that the lgroup kstats for pages migrated were integrated into 
Solaris with the changes for madvise(MADV_ACCESS_*) in Solaris 9 Update 2.



Jonathan

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] Project sponsorship request: Tesla

2007-06-05 Thread Jonathan Chew

Eric Saxe wrote:


I'd like the ask the OpenSolaris performance community for sponsorship 
of Project Tesla.


http://www.opensolaris.org/os/project/tesla

The Tesla project seeks to provide OpenSolaris with a platform 
independent power management policy architecture,
bringing power awareness to various kernel subsystems (including the 
dispatcher) that interact with power manageable system resources. I 
believe the performance community should have an interest in this 
project, since one of the project's primary goals is implementation of 
a "default" policy that seeks to deliver maximum system performance 
while consuming no more power than is necessary to do so.


Yes, I think that we should definitely sponsor this.


Jonathan

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] Project proposal: "Solaris Enhancements for AMD-based Platforms"

2007-10-30 Thread Jonathan Chew
Ostrovsky, Boris wrote:
> I would like to propose creation of a new project titled "Solaris
> Enhancements for AMD-based Platforms".
>
> The project will address various features that are specific to platforms
> based on AMD processors, such as
>   - IOMMU support
>   - NUMA topology, particularly how it affects IO performance
>   - Observability (performance counters, instruction-based
> sampling)
>   - Power management
>   - RAS features
>   - New instruction support, new CPUID features
>
> Since what's described above covers fairly diverse range of subjects,
> the proposed project will serve as an umbrella for sub-projects, each of
> them covering a particular area related to improving Solaris behavior on
> systems built around AMD processors (as well as chipsets and graphics
> components)
>
> I think this project would be of interest to a number of OpenSolaris
> communities but I am asking the performance community for sponsorship as
> it appears to be the most relevant. 
>   

+1 from me too.


Jonathan

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] NUMA and interconnect transfers

2007-12-11 Thread Jonathan Chew
Rafael Vanoni wrote:
> Hey everyone
>
> Is the kernel aware of the status of the interconnect between different 
> NUMA nodes ?
>   

No, not currently.  It just assumes that there is some interconnect 
between the nodes and may know the latency between them when the system 
is not loaded.


> For instance, when an app is transferring data from one node to another, 
> saturating the interconnect between these two. Are we aware of this ?
>   

No.


> I haven't seen any code that collects telemetry from the interconnect 
> hardware. It sounds like something that we should be looking at.
>   

There may be hardware performance counters that provide some 
observability for this.  I'm not sure though and it definitely would 
depend on the processor and platform.

This is definitely something that I have been thinking about recently.  
I do think that it may be useful since we want to be able to use 
machines efficiently (eg. for power, etc.) and provide good performance 
by default even as the hardware is becoming more 
sophisicated/complicated with CMT and NUMA.


Jonathan


___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] NUMA and interconnect transfers

2007-12-11 Thread Jonathan Chew
Rayson Ho wrote:
> On Dec 11, 2007 5:53 PM, Jonathan Chew <[EMAIL PROTECTED]> wrote:
>   
>> No, not currently.  It just assumes that there is some interconnect
>> between the nodes and may know the latency between them when the system
>> is not loaded.
>> 
>
> Jonathan, I just found out that you are the author of the MPO (Memory
> Placement Optimization) presentation:
>
> http://www.opensolaris.org/os/community/performance/mpo_overview.pdf
>   

Yes.  If you're interested, there is also an update to that presentation at:

http://www.opensolaris.org/os/community/performance/numa/mpo_update.pdf



>> There may be hardware performance counters that provide some
>> observability for this.  I'm not sure though and it definitely would
>> depend on the processor and platform.
>> 
>
> I was looking at the same thing too, but from an application
> developer's perspective. I then talked to my AMD friend, and he told
> me that CodeAnalyst provides a way to measure the HT bandwidth:
>
>  "HyperTransport link x transmit bandwidth" -- HT links 0, 1, and
> 2  (i.e., whether they're used for cache-coherence inter-processor
> traffic, or used for non-coherent I/O)
>
> And I read this document a while ago:
> "Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA
> Multiprocessor Systems Application Note"
> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40555.pdf
>   

Thanks for the pointers.  I had the funny feeling that Opteron would 
have counters for this, but I'm not so sure about SPARC and what will be 
available on Intel's Nehalem processors with QuickPath.


Jonathan
___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] NUMA and interconnect transfers

2007-12-12 Thread Jonathan Chew
Rayson Ho wrote:
> On Dec 11, 2007 9:33 PM, Jonathan Chew <[EMAIL PROTECTED]> wrote:
>   
>> Yes.  If you're interested, there is also an update to that presentation at
>> 
>
> I think I've read that one too, and the chapter in the Solaris
> Internals book... all 3 are interesting :-D
>
>
>   
>> Thanks for the pointers.  I had the funny feeling that Opteron would
>> have counters for this, but I'm not so sure about SPARC and what will be
>> available on Intel's Nehalem processors with QuickPath.
>> 
>
> Don't know about what QuickPath will have (hmm... "QuickPath" seems to
> be new to me, as I was so used to the old "CSI" name). However, as it
> will support Nehalem as well as Tukwila, it may be interesting to see
> what is currently available for Itanium-2:
>   

Yes, that seems likely.  I believe that Nehalem may very well inherit 
some CPU hardware performance counters, etc. from existing Intel processors.


>  ... we use the PMU to capture two different types of memory access
> data -- long latency loads and data translation lookaside buffer
> (DTLB) misses -- section 3: Profile Generation
>
> "Hardware Profile-guided Automatic Page Placement for ccNUMA Systems"
> http://moss.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/ppopp06.pdf
>
> And the Gelato ICE presentation:
> http://www.ice.gelato.org/apr07/pres_pdf/gelato_ICE07apr_pageplacement_mueller_ncsu.pdf
>   

Thanks for the pointer to the paper and presentation.  I only glanced 
through them, but they seem very similiar in spirit to page migration 
work that was done in the 1990's by some friends at Rochester, Stanford, 
and SGI.  In fact, I thought that IRIX had some sort of automatic page 
migration which may have used hardware performance counters to help 
decide what to migrate where.


Jonathan

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] Contributor nomination for Ashok Raj

2008-02-26 Thread Jonathan Chew
Sherry Moore wrote:
> I'd like to nominate Ashok Raj for a contributor grant in our community.
>
> Ashok has made (and continues to make) tremendous contributions to the
> perf community sponsored OpenSolaris Intel-platform Project.  He not
> only provides code drops for new features and performance improvements,
> but also participates in design reviews and code reviews.  The projects
> and RFEs he contributed to include (but not limited to)
> - Intel Microcode Update Support
> - monitor/mwait implementation
> - CPUID support for Penryn and Nehalem
> and many more.
>
> He is an active participant on various OpenSolaris discussion lists,
> and has become a core member of the Intel-platform project team.
> Recognition for his contributions is also long overdue. :)
>   

+1


Jonathan

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] Contributor nomination for Robert Kasten

2008-02-26 Thread Jonathan Chew
Bill Holler wrote:
> I'd like to nominate Robert Kasten for a contributor grant in our community.
>
> Robert has made (and continues to make) tremendous contributions to the
> perf community sponsored OpenSolaris Intel-platform Project.
> He provides code drops for performance improvements and participates in
> design reviews and code reviews.  Robert also mentors other engineers
> working on Intel/Sun performance projects.  The projects and RFEs he
> contributed to include (but not limited to)
> - libc
> - kernel copy primitives
> - CPUID support and bug fixes for Core 2 Duo, Penryn and Nehalem 
> and many more.
>
> He is an active participant on various OpenSolaris discussion lists,
> and has become a core member of the Intel-platform project team.
> Recognition for his contributions is also long overdue.  :) 
>   

+1


Jonathan


___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] core contributer nomination for Sherry Moore

2008-02-26 Thread Jonathan Chew
Eric Saxe wrote:
> I'd like to nominate Sherry Moore for a core contributer grant in our 
> community.
>
> Sherry is the tech lead for the community sponsored "Enable/Enhance 
> Solaris support for Intel Platform" project, and is a Senior Staff 
> Engineer in the Solaris Kernel group. Her contributions to Solaris / 
> OpenSolaris performance have been many, but to list a few:
>
> - Reducing Cache Pollution with Non-temporal Access, which improved 
> write performance by 80-120% on x86 platforms.
> - Enabled ON compilation with Sun Studio 10 which improved build time.
> - Fast Reboot (ongoing) From "reboot" command to banner in 5 seconds.
>
> Other notable accomplishments:
>
> - Intel Microcode Update Support
> - "save-args" on amd64, so that we can have the stack arguments 
> available via the debugger
> - Worked on the port of Solaris to amd64
> - Worked extensively on Solaris Dynamic Reconfiguration (DR) support
>   

+1


Jonathan

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] contributer nomination for Aubrey Li

2008-02-26 Thread Jonathan Chew
Eric Saxe wrote:
> I'd like to nominate Aubrey Li for a contributer grant in our community.
>
> Aubrey has made (and continues to make) tremendous contributions to the 
> perf community sponsored OpenSolaris Tesla project though his code and 
> design contributions to the OpenSolaris PowerTop port, deep C-state 
> support, and more. He is an active participant on various OpenSolaris 
> discussion lists, and has become a core member of the Tesla project 
> team. Recognition for his contributions is long overdue. :)
>   

+1


Jonathan

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] Core Contributer nominations for Jim Mauro, Adrian Cockcroft

2008-02-27 Thread Jonathan Chew
Eric Saxe wrote:
> Great suggestions. Richard is already a core contributer in the community.
> I would like to nominate Jim Mauro and Adrian Cockcroft for core 
> contributer grants.
>
> Both Adrian and Jim have long standing track records of great 
> contributions in the area of Solaris performance, and in fact have 
> written several books about it. :)
>   

+1


Jonathan

>
> Brandorr wrote:
>   
>> I am not a contributor to perf, but what about:
>>
>> Adrian
>> Jim Mauro
>> Richard McDougal
>>
>> for starters.
>>
>> -Brian
>>
>> P.S. - Please forgive spelling.
>>
>> On Tue, Feb 26, 2008 at 6:46 PM, Eric Saxe <[EMAIL PROTECTED]> wrote:
>>   
>> 
>>> Greetings,
>>>
>>>  You may have noticed that other communities have been (and are) defining
>>>  roles and grants as outlined in section 3.3 of the OpenSolaris
>>>  constitution, and the time has arrived for us (the performance
>>>  community) to do the same. This URL:
>>>
>>>  http://www.opensolaris.org/os/community/performance/roles
>>>
>>>  ...provides some background, and outlines a nomination and voting
>>>  process for contributer and core contributer grants in our community.
>>>
>>>  Unfortunately, there simply isn't enough time for any new Core
>>>  Contributers grants to take effect (in terms of voting) for the upcoming
>>>  OpenSolaris election, but I believe it's still worth kicking off this
>>>  process now so that deserving folks in our community can be recognized
>>>  for their ongoing contributions to OpenSolaris performance.
>>>
>>>  So with that, i'd like to open up nominations
>>>
>>>  Thanks,
>>>  -Eric
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>  ___
>>>  perf-discuss mailing list
>>>  perf-discuss@opensolaris.org
>>>
>>> 
>>>   
>>
>>   
>> 
>
> ___
> perf-discuss mailing list
> perf-discuss@opensolaris.org
>
>   

___
perf-discuss mailing list
perf-discuss@opensolaris.org


[perf-discuss] CMT project

2008-08-15 Thread Jonathan Chew
I would like to get sponsorship from the OpenSolaris performance 
community to host a CMT project which will focus on observability, 
performance enhancements, and potentially more in OpenSolaris for Chip 
Multi-Threaded (CMT) processors (including SMT, CMP, etc.).

Specifically, the project will try to do the following in OpenSolaris:

- Further develop a processor group abstraction for capturing the CMT 
processor sharing relationhips of performance relevant hardware 
components (eg. execution pipeline, cache, etc.)

- Create an interface for determining which CPUs share what performance 
relevant hardware and the characteristics of these performance relevant 
hardware components

- Add more performance optimizations to Solaris for CMT (eg. scheduling, 
I/O, etc.)

- Improve load balancing for maximizing performance and potentially 
minimizing power consumption

- Create APIs to facilitate performance optimizations for CMT

- Make changes needed to make all of the above work well with virtualization

- Improve upon the existing Solaris CMT enhancements

- Add support for new CMT hardware as needed

- Address any OpenSolaris CMT issues as they arise


In the process of doing all of this, I'm hoping that the project will 
facilitate collaboration in this area as well as a better understanding 
and appreciation of  CMT and OpenSolaris.


Jonathan


___
perf-discuss mailing list
perf-discuss@opensolaris.org


[perf-discuss] NUMA project

2008-08-15 Thread Jonathan Chew
I would like to get sponsorship from the OpenSolaris performance 
community to host a NUMA project.

The "Memory Placement Optimization" feature in Solaris has been around 
since Solaris 9 and has had web pages in the OpenSolaris performance 
community since it started before OpenSolaris projects existed formally.

I would like to formalize its existence as project in the performance 
community since there is more work to be done for NUMA.  Specifically, 
the project will try to do the following in OpenSolaris:

- Make MPO aware of I/O device locality

- Add observability of *kernel* thread and memory placement

- Optimize kernel thread and memory placement to improve performance in 
general and for NUMA I/O

- Add dynamic lgroup load balancing to improve performance and 
potentially minimize power consumption

- Enhance MPO to work well with virtualization and vice versa

- Add support for new NUMA machines

- Improve the existing framework as needed

- Address any OpenSolaris NUMA issues as they arise


Also, formalizing NUMA as an OpenSolaris project will enable the project 
to have space on opensolaris.org for sharing code, ideas, etc. which I'm 
hoping will facilitate collaboration in this area as well as a better 
understanding and appreciation of NUMA and OpenSolaris.

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] CMT and NUMA proposals

2008-08-15 Thread Jonathan Chew

Thanks for the pointers.

Here are my revised proposals.


Jonathan


[EMAIL PROTECTED] wrote:

Hi Jonathan,
I'm in favor of both of these proposals.  However, I think they're
incomplete, at least according to the project instantiation guidelines.

Would you amend these proposals to include the participants for each
project, information about each project's mailing list, and the
consolidation that the projects are targeting?

The project instantiation has additional reccomendations that may be of
interest here too:

http://opensolaris.org/os/community/ogb/policies/project-instantiation.txt

Thanks,

-j


  


I would like to get sponsorship from the OpenSolaris performance community to 
host a CMT project which will focus on observability, performance enhancements, 
and potentially more in OpenSolaris for Chip Multi-Threaded (CMT) processors 
(including SMT, CMP, etc.).  This project will target the ON consolidation 
although it may affect others.

Specifically, the project will try to do the following in OpenSolaris:

- Further develop a processor group abstraction for capturing the CMT processor 
sharing relationhips of performance relevant hardware components (eg. execution 
pipeline, cache, etc.)

- Create an interface for determining which CPUs share what performance 
relevant hardware and the characteristics of these performance relevant 
hardware components

- Add more performance optimizations to Solaris for CMT (eg. scheduling, I/O, 
etc.)

- Improve load balancing for maximizing performance and potentially minimizing 
power consumption

- Create APIs to facilitate performance optimizations for CMT

- Make changes needed to make all of the above work well with virtualization

- Improve upon the existing Solaris CMT enhancements

- Add support for new CMT hardware as needed

- Address any OpenSolaris CMT issues as they arise


In the process of doing all of this, I'm hoping that the project will 
facilitate collaboration in this area as well as a better understanding and 
appreciation of  CMT and OpenSolaris.

The initial project team members are:

- [EMAIL PROTECTED]

- [EMAIL PROTECTED] (project leader)

- [EMAIL PROTECTED]

- [EMAIL PROTECTED]

- [EMAIL PROTECTED]
I would like to get sponsorship from the OpenSolaris performance community to 
host a NUMA project.  This project will target the Operating System/Networking 
(ON) consolidation although it may affect others.

The "Memory Placement Optimization" feature in Solaris has been around since 
Solaris 9 and has had web pages in the OpenSolaris performance community since 
it started before OpenSolaris projects existed formally (see 
http://opensolaris.org/os/community/performance/numa/ for more info).

I would like to formalize its existence as project in the performance community 
since there is more work to be done for NUMA.  Specifically, the project will 
try to do the following in OpenSolaris:

- Make MPO aware of I/O device locality

- Add observability of *kernel* thread and memory placement

- Optimize kernel thread and memory placement to improve performance in general 
and for NUMA I/O

- Add dynamic lgroup load balancing to improve performance and potentially 
minimize power consumption

- Enhance MPO to work well with virtualization and vice versa

- Add support for new NUMA machines

- Improve the existing framework as needed

- Address any OpenSolaris NUMA issues as they arise


Also, formalizing NUMA as an OpenSolaris project will enable the project to 
have space on opensolaris.org for sharing code, ideas, etc. which I'm hoping 
will facilitate collaboration in this area as well as a better understanding 
and appreciation of NUMA and OpenSolaris.

The initial project team members are:

- [EMAIL PROTECTED]

- [EMAIL PROTECTED] (project leader)

- [EMAIL PROTECTED]

- [EMAIL PROTECTED]

- [EMAIL PROTECTED]

- [EMAIL PROTECTED]
___
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] CMT and NUMA proposals

2008-08-15 Thread Jonathan Chew
[EMAIL PROTECTED] wrote:
> A plus one to both.
>
> I realize it may not be feasible now, but long term it would be great if
> we could move the sun-internal e-mail lists that exist around these
> projects into the open.  If we get enough projects using perf-discuss
> as a default mailing list it's going to be confusing for everyone.
>   

It is the *goal* for us to get our internal Sun project aliases as well 
as our development out into opensolaris.org.  We can definitely explore 
making separate OpenSolaris email aliases for CMT and NUMA once the 
projects are approved


> Thanks for taking the time to get these projects proposed and into
> OpenSolaris.
>   

Sure.  No problem.  It is long overdue.



Jonathan


> On Fri, Aug 15, 2008 at 03:22:10PM -0700, Jonathan Chew wrote:
>   
>> Thanks for the pointers.
>>
>> Here are my revised proposals.
>>
>>
>> Jonathan
>>
>>
>> [EMAIL PROTECTED] wrote:
>> 
>>> Hi Jonathan,
>>> I'm in favor of both of these proposals.  However, I think they're
>>> incomplete, at least according to the project instantiation guidelines.
>>>
>>> Would you amend these proposals to include the participants for each
>>> project, information about each project's mailing list, and the
>>> consolidation that the projects are targeting?
>>>
>>> The project instantiation has additional reccomendations that may be of
>>> interest here too:
>>>
>>> http://opensolaris.org/os/community/ogb/policies/project-instantiation.txt
>>>
>>> Thanks,
>>>
>>> -j
>>>
>>>
>>>   
>>>   
>
>   
>> I would like to get sponsorship from the OpenSolaris performance community 
>> to host a CMT project which will focus on observability, performance 
>> enhancements, and potentially more in OpenSolaris for Chip Multi-Threaded 
>> (CMT) processors (including SMT, CMP, etc.).  This project will target the 
>> ON consolidation although it may affect others.
>>
>> Specifically, the project will try to do the following in OpenSolaris:
>>
>> - Further develop a processor group abstraction for capturing the CMT 
>> processor sharing relationhips of performance relevant hardware components 
>> (eg. execution pipeline, cache, etc.)
>>
>> - Create an interface for determining which CPUs share what performance 
>> relevant hardware and the characteristics of these performance relevant 
>> hardware components
>>
>> - Add more performance optimizations to Solaris for CMT (eg. scheduling, 
>> I/O, etc.)
>>
>> - Improve load balancing for maximizing performance and potentially 
>> minimizing power consumption
>>
>> - Create APIs to facilitate performance optimizations for CMT
>>
>> - Make changes needed to make all of the above work well with virtualization
>>
>> - Improve upon the existing Solaris CMT enhancements
>>
>> - Add support for new CMT hardware as needed
>>
>> - Address any OpenSolaris CMT issues as they arise
>>
>>
>> In the process of doing all of this, I'm hoping that the project will 
>> facilitate collaboration in this area as well as a better understanding and 
>> appreciation of  CMT and OpenSolaris.
>>
>> The initial project team members are:
>>
>> - [EMAIL PROTECTED]
>>
>> - [EMAIL PROTECTED] (project leader)
>>
>> - [EMAIL PROTECTED]
>>
>> - [EMAIL PROTECTED]
>>
>> - [EMAIL PROTECTED]
>> 
>
>   
>> I would like to get sponsorship from the OpenSolaris performance community 
>> to host a NUMA project.  This project will target the Operating 
>> System/Networking (ON) consolidation although it may affect others.
>>
>> The "Memory Placement Optimization" feature in Solaris has been around since 
>> Solaris 9 and has had web pages in the OpenSolaris performance community 
>> since it started before OpenSolaris projects existed formally (see 
>> http://opensolaris.org/os/community/performance/numa/ for more info).
>>
>> I would like to formalize its existence as project in the performance 
>> community since there is more work to be done for NUMA.  Specifically, the 
>> project will try to do the following in OpenSolaris:
>>
>> - Make MPO aware of I/O device locality
>>
>> - Add observability of *kernel* thread and memory placement
>>
>> - Optimize kernel thread and memory placement to improve performance in 
>> general and for NUMA I/O
>>
>> - Add dynamic lgroup load 

Re: [perf-discuss] CMT project

2008-08-18 Thread Jonathan Chew
Elad Lahav wrote:
> Hi Jonathan,
>
> I am currently looking into different affinity policies on the Niagara 
> architecture, including the application threads and interrupt handlers. 

We are interested in affinity scheduling too.


> An important aspect of this work is the hierarchy of processing units 
> (chips, cores, threads) with respect to cache sharing, resource 
> utilisation, etc. I believe this quite relevant to what you have in mind.

I'm not sure whether you know much about what we have available in 
OpenSolaris already, but we have a "Processor Group (PG)" abstraction 
inside the kernel for keeping track of which CPUs share performance 
relevant hardware components (eg. execution pipeline, cache, etc.) now.  
The Processor Groups are organized into a hierarchy (tree) where each 
leaf corresponds to each CPU (eg. strand) and its ancestors are groups 
of processors ordered from the ones that share the most with the CPU 
(eg. CPUs that share execution pipeline with the leaf CPU) to the ones 
that share the least (eg. all the CPUs in the machine).


> I'll be more than willing to co-operate with you, or anyone else in 
> the OpenSolaris community, on this subject. 

That sounds great!

Can you please tell me more about your work or point me at any relevant 
papers or docs?



Jonathan


>
>
> Jonathan Chew wrote:
>> I would like to get sponsorship from the OpenSolaris performance 
>> community to host a CMT project which will focus on observability, 
>> performance enhancements, and potentially more in OpenSolaris for 
>> Chip Multi-Threaded (CMT) processors (including SMT, CMP, etc.).
>>
>> Specifically, the project will try to do the following in OpenSolaris:
>>
>> - Further develop a processor group abstraction for capturing the CMT 
>> processor sharing relationhips of performance relevant hardware 
>> components (eg. execution pipeline, cache, etc.)
>>
>> - Create an interface for determining which CPUs share what 
>> performance relevant hardware and the characteristics of these 
>> performance relevant hardware components
>>
>> - Add more performance optimizations to Solaris for CMT (eg. 
>> scheduling, I/O, etc.)
>>
>> - Improve load balancing for maximizing performance and potentially 
>> minimizing power consumption
>>
>> - Create APIs to facilitate performance optimizations for CMT
>>
>> - Make changes needed to make all of the above work well with 
>> virtualization
>>
>> - Improve upon the existing Solaris CMT enhancements
>>
>> - Add support for new CMT hardware as needed
>>
>> - Address any OpenSolaris CMT issues as they arise
>>
>>
>> In the process of doing all of this, I'm hoping that the project will 
>> facilitate collaboration in this area as well as a better 
>> understanding and appreciation of  CMT and OpenSolaris.
>>
>>
>> Jonathan
>>
>>
>> ___
>> perf-discuss mailing list
>> perf-discuss@opensolaris.org
>
>

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] Q: Rehome thread to new lgroup?

2008-12-10 Thread Jonathan Chew
Manfred Mücke wrote:
> Hi,
>
> I want to enforce migration of thread+allocated memory to another lgroup but 
> failed for some threads.
>   

Ok.  You can give MADV_ACCESS_LWP as the advice to madvise(3C) for the 
memory that you want to migrate.  This advice tells the Solaris kernel 
that the next thread to touch the memory will use it a lot and the 
kernel will migrate the memory near the thread that next touches that 
memory (ie. in or near the thread's home lgroup).

If you don't want to change your application to do this, you can use 
pmadvise(1) instead, specifying the same advice for your specified 
virtual memory.


> I understand from the "Memory and Thread Placement Optimization Developer's 
> Guide", that there is no hard memory affinity in Solaris, but only the 
> possibility to define some level of affinity to the home lgroup. Therefore, 
> prior to any kind of memory migration it seems necessary to move the home 
> lgroup to the lgroup where the memory should migrate to. The lgrp_home(3LGRP) 
> man page defines the  "home lgroup" of a  thread as the "lgroup  with  the 
> strongest  affinity  that  the thread can run on". 
>
> A sequence for memory migration was already lined out in 
> http://www.opensolaris.org/jive/thread.jspa?messageID=14792:
>
> - Change the thread in questions home lgroup to the CPU where you want the 
> memory allocated,
> - use 'pmadvise -o heap=lwp_access' on the process
> - the memory should now get allocated in the newly homed lgrp (check with 
> 'pmap -L' and lgrpinfo (if the allocation is large enough to notice)).
> - rehome the lgrp to another CPU or just bind it.
>
> I tried, but for some threads, I got surprising results:
>
>   
>> ./plgrp -a all 27215
>> 
>  PID/LWPIDHOME  AFFINITY
>27215/12 5/strong,0-4,6-24/none
>
> If the home lgroup is defined as the lgroup with the strongest affinity, 
> isn't the output above somewhat contradictory?
>   

Yes, this means that the home lgroup of the specified thread is 2 even 
though it has a strong affinity for lgroup 5.


>> ./plgrp -F -H 5 27215
>> 
>  PID/LWPIDHOME
>27215/12 => 2
>
> This thread seems to be resistant to plgrp's attempts to assign a new home 
> lgroup (option -H) to it.
>
> This is a test program. For some other threads, I was able to freely rehome 
> them to any lgroup in the range 1..8 (the leaf lgroups). Has anyone some 
> further suggestions on what could prevent reassigning the home lgroup?
>   

If the thread is bound to a CPU that isn't in lgroup 5 or bound to a 
processor set that does not overlap lgroup 5, then the thread cannot be 
rehomed to lgroup 5 even though it may have a strong affinity for lgroup 5.

You should be able to use pbind(1M) and psrset(1M) to see whether your 
thread is bound.


> My system is a Sun Fire X4600 with eight dual-core Opterons. Lgroups 1..8 are 
> identical (i.e. same amount of memory and CPU resources per lgroup), except 
> for the CPU IDs:
>   
>> lgrpinfo.pl -a
>> 
> [..]
> lgroup 1 (leaf):
> Children: none, Parent: 9
> CPUs: 0 1
> Memory: installed 3584 Mb, allocated 985 Mb, free 2599 Mb
> Lgroup resources: 1 (CPU); 1 (memory)
> Load: 0.00723
> Latency: 51
> [..]
>   

If you're still having problems after trying what I suggested above, you 
should include *all* of the output from "lgrpinfo -Ta" so we can see 
where lgroup 5 is in the topology and what it contains.



Jonathan

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] capacity planning tools - status

2009-01-20 Thread Jonathan Chew
Stefan Parvu wrote:
>
>> For papers and books on this, start with the
>> Teamquest site and view their webinar
>> http://teamquest.com/resources/webinars/display/30/index.htm
>> and if you don't mind math, follow up with Neil
>> Gunther's site, http://www.perfdynamics.com/
>> 
>
> Right. I have been in GCaP class, kept by Neil Gunther 
> and my questions actually come part of the feedback from the class, 
> part from my side:
>
>  - corestat integrated into Solaris. Customers as well asked about 
> this and why is not integrated into (Open)Solaris.
>   

corestat isn't integrated into Solaris because it hasn't been 
productized and has some issues (eg. must be run as root, prevents 
anyone else from using the CPU performance counters while it is running, 
only works for Niagara processors, may not be sufficient for doing 
capacity planning and performance analysis, etc.).  I am investigating 
making something better.



Jonathan



___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] Expiring Core Contributor Grants

2009-02-05 Thread Jonathan Chew
Eric Saxe wrote:
> johan...@sun.com wrote:
>   
>> I'm going to begin this process by voting in favor of renewing akolb,
>> barts, esaxe, johansen, jjc, mpogue, and dp.  If andrei or rmc are still
>> interested in participating, I would support renewing their grants.  I
>> simply haven't seen any traffic from them on our lists lately.  It's
>> possible this is simply an oversight on my part, and if so I apologize.
>>   
>> 
> +1 as well for akolb, barts, esaxe, johansen, jjc, mpogue, and dp. I 
> don't know if andrei or rmc are still interested in participating. If 
> they are, I would support renewal as well.
>   

+1. Me too.


Jonathan

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] Contributer grant nomination for Chad Mynheir

2009-02-06 Thread Jonathan Chew
+1


Jonathan

Eric Saxe wrote:
> I'd like to put forth a contributer grant nomination for Chad Mynhier, 
> for his recent contributions in the area of performance observability, 
> including contributing fixes for:
>
> PSARC 2007/598 ptime(1) Improvements
> 6234106 ptime should report microseconds or nanoseconds instead of just 
> milliseconds
> 4532599 ptime should report resource usage statistics from /proc for running 
> processes
> 6749441 intrstat(1M) shows zeroed values after suspend/resume
> 6786517 /usr/demo/dtrace/qtime.d tracks enqueue/dequeue times incorrectly
> 6738982 Representative thread after DTrace stop() action is incorrect
> 6737974 DTrace probe module globbing doesn't work
> 6730287 tst/common/printf/tst.str.d needs to be updated for ZFS boot
>
>
> He quite clearly has been making significant contributions, and IMHO 
> recognition for his contributions is well deserved.
> (Thank you Chad)
>
> -Eric
> ___
> perf-discuss mailing list
> perf-discuss@opensolaris.org
>   

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] Expiring Core Contributor Grants

2009-02-12 Thread Jonathan Chew

Eric Saxe wrote:

Eric Saxe wrote:


Also, in the past I've acted as de facto facilitator for this 
community...but we've never had a vote around that (at least not one 
that I remember).

Are you interested Krister? If so, I nominate you. :)
Krister has accepted my nomination for community facilitator. All 
those in favor?...


+1


Jonathan

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] V890 with US-IV+ : which core should interrupts be delivered to?

2009-06-19 Thread Jonathan Chew

johan...@sun.com wrote:

On Fri, Jun 19, 2009 at 05:52:02PM +0200, Nils Goroll wrote:
  
I am trying to reduce context switches on a V890 with 16 cores by 
delegating one core to interrupt handling (by setting all others to 
nointr using psradm).



The best way to do this is to create processor sets, with the set that
handles OS tasks containing processors that handle interrupts and the
other set containining the nointr processors.  If you're really
concerned about context switching, you might also benefit from processor
binding, which only lets a process run on a particular cpu.

Lately, the scheduler guys have made a bunch of improvements that allow
processes to be more sticky, though.  (Less likely to migrate from one
CPU to another, but that is different from a context switch)  What
version of the OS are you running?

  
Does the hardware design of this machine imply that (a) particular 
core(s) is best suited for this task?



Almost any multiprocessor machine that has NUMA characteristics will
benefit from having the interrupts placed on the CPU that's closest to
the hardware that connects to the bus that the interrupts are coming
from.  From a discussion long ago, I remember the v890 has some NUMA
characteristics, but I don't honestly know if we treat it like a NUMA
machine.  Jonathan Chew, who's lurking on the list somewhere, might have
more details.
  


Lurking?!  Who me?  I'm innocent.  ;-)

I think that Paul already replied and explained that the V890 doesn't 
have NUMA I/O, so it doesn't matter which CPU is used for the interrupt 
with respect to the device for NUMA.  However, it may matter where you 
assign the interrupt for NUMA memory locality and CMT cache sharing 
(since each UltraSPARC IV+ processor chip has two single stranded cores 
that share cache)


If there are (user or kernel) threads that are involved in doing I/O 
to/from the device, they may benefit from sharing (or not sharing) cache 
and/or local memory with the interrupt thread and CPU where the 
interrupt is assigned.  For example, the Crossbow poll thread will 
benefit from being bound to the same CPU as the interrupt or the other 
CPU sharing cache with the interrupt thread/CPU and the Crossbow soft 
ring threads will at least benefit from sharing local memory with the 
interrupt thread/CPU if you're doing network I/O on OpenSolaris.  You 
also should be careful what threads are placed on the CPU sharing cache 
with the interrupt CPU and thread because the threads may interfere with 
each other's cache utilization.


Your user application threads may benefit from running close or far away 
from the interrupt thread/CPU depending on what the threads do.


Thus, how you place your interrupt and your threads and whether you use 
processor sets, processor binding, etc. will depend on your workload and 
what you are trying to optimize.


Hope that helps.



Jonathan

PS
Now I'll go back to lurking/sleeping.  ;-)

___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] madvise() and "heap" memory

2009-07-09 Thread Jonathan Chew

johan...@sun.com wrote:

On Thu, Jul 09, 2009 at 12:18:17PM -0500, Bob Friesenhahn wrote:
  
Do madvise() options like MADV_ACCESS_LWP and MADV_ACCESS_MANY work on  
memory allocated via malloc()?  If MADV_ACCESS_LWP is specified and  
malloc() hands out heap memory which has been used before (e.g. by some 
other LWP), is the memory data in the range moved (physically copied and 
remapped to the same address via VM subsystem) closest to the CPU core 
which next touches it?



Currently, segvn is the only segment driver with support for madvise
opertaions;


This part isn't exactly right.  madvise(3C) is also supported by the 
segspt virtual memory segment driver for Intimate Shared Memory (ISM) 
along with segvn.




Jonathan


___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] NUMAtop for OpenSolaris

2010-01-13 Thread Jonathan Chew
There has been a lot of discussion on this since it was proposed last 
month.  I want to know what is currently being proposed given the 
lengthy discussion.


Can someone please summarize what the current proposal is now?



Jonathan


Li, Aubrey wrote:

johansen wrote:
  

On Tue, Jan 12, 2010 at 02:20:02PM +0800, zhihui Chen wrote:


Application can be categoried into CPU-sensitive, Memory-sensitive,
IO-sensitive.
  

My concern here is that unless the customer knows how to determine
whether his application is CPU, memory, or IO sensitive it's going to be
hard to use the tools well.




"sysload" in NUMAtop can tell the customer if the app is cpu sensitive.
"Last Level Cache Miss per Instruction" will be added into NUMAtop to 
determine if the app is memory sensitive.


  

When CPU trigged one LLC miss, the data can be gotten from local
memory, cache or memory in remote node. Generlly, the latency for
local memory will be close to latency for remote cache, while latency
for remote memory should be much higher.
  

This isn't universally true.  On some SPARC platforms, it actually takes
longer to read a line out of a remote CPU's cache than it does to access
the memory on a remote system board.  On a large system, many CPU's may
have this address in their cache, and they all need to know that it has
become owned by the reading CPU.  If you're going to make this tool work
on SPARC, it won't always be safe to make this assumption.

-j



Thanks to point this issue out. We are not SPARC expert and I think SPARC
NUMAtop design is not in our phase I design, :)
We hope the SPARC expert like you or other expert can take SPARC into 
account and extend this tool onto SPARC platform.


  

On systems where some remote memory accesses take longer than others,
this could be especially useful.  Instead of just reporting the number
of remote accesses, it would be useful to report the amount of time
  

the


application spent accessing that memory.  Then it's possible for the
user to figure out what kind of performance win they might achieve by
making the memory accesses local.

  


As for the metric of NUMAtop, the memory access latency is a good idea.
But the absolute amount is not a good indicator for NUMAtop. This amount
will be different on different platforms, a specific number of amount is
good on one platform while it's bad on another one. It's hard to tell the
customer what data is good. So we will introduce a ratio into NUMAtop, 

"LLC Latency ratio" = 
"the actual memory access latency" / "calibrated local memory access latency"


We assume different node hop has different memory access latency, longer
distance node hop has the longer memory access latency. This ratio will be near
to 1 if most of the memory access of the application is to the local memory.

So as a conclusion, here we propose the metrics of NUMAtop
1) sysload  -  cpu sensitive
2) LLC Miss per Instruction - memory sensitive
3) LLC Latency ratio - memory locality
4) the percent of the number of LMA/RMA access / total memory access
- 4.1) LMA/(total memory access)%
- 4.2) RMA/(total memory access)%
- ...

4.2) could separate into different % onto different NUMA hop.
These parameters are not platform specific and probably be common enough to 
extend
to SPARC platform.

Looking forward to your thoughts.

BTW: Do we still need one more +1 vote for NUMAtop project? 


Thanks,
-Aubrey
___
perf-discuss mailing list
perf-discuss@opensolaris.org
  


___
perf-discuss mailing list
perf-discuss@opensolaris.org


Re: [perf-discuss] NUMAtop for OpenSolaris

2010-01-19 Thread Jonathan Chew

Li, Aubrey wrote:

Hi Jonathan,

Nice to see you have interest.

We are discussing the metrics of NUMAtop, and so far the proposal is
that the following parameters will be reported by NUMAtop as the metrics.

1) sysload  -  cpu sensitive
2) LLC Miss per Instruction - memory sensitive
3) LLC Latency ratio - memory locality
4) the percent of the number of LMA/RMA access / total memory access
- 4.1) LMA/(total memory access)%
- 4.2) RMA/(total memory access)%

4.2) could be separated into different % onto different NUMA node hop.
These parameters are not platform specific and probably be common
enough to extend to SPARC platform.
  


Thanks for summarizing the metrics.  However, I wanted to see a summary 
of the overall NUMAtop proposal given the feedback that you have gotten, 
so I can understand what the project is proposing to do now that you 
have gotten feedback.  Then I can decide whether I have anything to add 
and whether I want to approve it as is or not.


From the email thread so far, it looks as though Krish gave a very 
brief description of the project, Jin Yao explained some phases for the 
project, and you have listed some proposed metrics for the tool


Have anything of these changed given the feedback that you have gotten?  
Can you please summarize your latest project proposal including the 
description, phases, metrics, and anything else that is useful for 
understanding what the project is proposing to do?




Jonathan

Jonathan Chew wrote:
  

There has been a lot of discussion on this since it was proposed last
month.  I want to know what is currently being proposed given the
lengthy discussion.

Can someone please summarize what the current proposal is now?



Jonathan


Li, Aubrey wrote:


johansen wrote:

  

On Tue, Jan 12, 2010 at 02:20:02PM +0800, zhihui Chen wrote:



Application can be categoried into CPU-sensitive, Memory-sensitive,
IO-sensitive.

  

My concern here is that unless the customer knows how to determine
whether his application is CPU, memory, or IO sensitive it's going to


be


hard to use the tools well.




"sysload" in NUMAtop can tell the customer if the app is cpu sensitive.
"Last Level Cache Miss per Instruction" will be added into NUMAtop to
determine if the app is memory sensitive.


  

When CPU trigged one LLC miss, the data can be gotten from local
memory, cache or memory in remote node. Generlly, the latency for
local memory will be close to latency for remote cache, while
  

latency


for remote memory should be much higher.

  

This isn't universally true.  On some SPARC platforms, it actually


takes


longer to read a line out of a remote CPU's cache than it does to


access


the memory on a remote system board.  On a large system, many CPU's


may


have this address in their cache, and they all need to know that it


has


become owned by the reading CPU.  If you're going to make this tool


work


on SPARC, it won't always be safe to make this assumption.

-j



Thanks to point this issue out. We are not SPARC expert and I think
  

SPARC


NUMAtop design is not in our phase I design, :)
We hope the SPARC expert like you or other expert can take SPARC into
account and extend this tool onto SPARC platform.


  

On systems where some remote memory accesses take longer than others,
this could be especially useful.  Instead of just reporting the
  

number


of remote accesses, it would be useful to report the amount of time

  

the



application spent accessing that memory.  Then it's possible for the
user to figure out what kind of performance win they might achieve
  

by


making the memory accesses local.


  

As for the metric of NUMAtop, the memory access latency is a good idea.
But the absolute amount is not a good indicator for NUMAtop. This
  

amount


will be different on different platforms, a specific number of amount
  

is


good on one platform while it's bad on another one. It's hard to tell
  

the


customer what data is good. So we will introduce a ratio into NUMAtop,

"LLC Latency ratio" =
"the actual memory access latency" / "calibrated local memory access
  

latency"


We assume different node hop has different memory access latency,
  

longer


distance node hop has the longer memory access latency. This ratio
  

will be near


to 1 if most of the memory access of the application is to the local
  

memory.


So as a conclusion, here we propose the metrics of NUMAtop
1) sysload  -  cpu sensitive
2) LLC Miss per Instruction - memory sensitive
3) LLC Latency ratio - memory locality
4) the percent of the number of LMA/RMA access / total memory access
- 4.1) LMA/(to

Re: [perf-discuss] NUMAtop for OpenSolaris

2010-02-23 Thread Jonathan Chew

Li, Aubrey wrote:

Hi Jonathan,

Do you have any comments about this proposal?

Thanks,
-Aubrey

Li, Aubrey wrote:
  

Jonathan Chew wrote:


Thanks for summarizing the metrics.  However, I wanted to see a summary
of the overall NUMAtop proposal given the feedback that you have gotten,
so I can understand what the project is proposing to do now that you
have gotten feedback.  Then I can decide whether I have anything to add
and whether I want to approve it as is or not.

From the email thread so far, it looks as though Krish gave a very
brief description of the project, Jin Yao explained some phases for the
project, and you have listed some proposed metrics for the tool

Have anything of these changed given the feedback that you have gotten?
Can you please summarize your latest project proposal including the
description, phases, metrics, and anything else that is useful for
understanding what the project is proposing to do?


Jonathan
  

NUMAtop focus on NUMA-related characteristic, it's a tool to help
developers
identify memory locality in NUMA systems. The tool is top-like that
shows
the top N processes in the system and their memory locality, with those
processes
that have the worst memory locality will be at the top of the list, it
can
attach into a process to show the threads memory locality in the top
style as well.

The information NUMAtop reported is collected from memory-related
hardware
counters and libcpc Dtrace provider. Some of these counters are already
supported
in kcpc and libcpc, while some of them are not. Intel Nehalem-based and
next-generation platform provide memory load latency event, which is an
important approach of NUMAtop and needs PEBS framework solaris
implementation.

The following proposed metrics will be one part of our phase I job.
Application can be classified into CPU-sensitive, Memory-sensitive, IO-
sensitive.
IO-sensitive application can be idendified by low CPU utilization.
Memory-sensitive
application should be CPU-sensitive application with high CPU
utilization.



Can you please explain what you mean by CPU, memory, and I/O sensitive?  
What do these have to do with memory locality?




So we have the following metrics:

1) sysload  -  cpu sensitive



What do you mean by "sysload"?



2) LLC Miss per Instruction - memory sensitive



So, is a memory sensitive thread one that has low or high LLC mis per 
instruction?




After we figure out the application is memory-sensitive, we'll check
memory locality
metrics to see what is the performance regression cause.



How will you do that?  Do you mean that you will try to use the four 
metrics that you have listed here to determine the cause?




3) LLC Latency Ratio(Average Latency for LLC Miss/Local Memory Access
Latency)



Will the latency for each LLC miss be measured then?  Is the local 
memory latency the *ideal* local memory latency when the system is 
unloaded or the *current* local memory latency which may be higher than 
the ideal because of load?




4) Source distribution for LLC miss:
 -4.1)LMA/(Total LLC Miss Retired)%
 -4.2)RMA/(Total LLC Miss Retired)%



Will these ratios be given for each NUMA node, the whole system, or both?



Here, 4.2) could be separated into different % onto different NUMA node
hop.



Do you mean that the total RMA will be broken down into percentage of 
remote memory accesses to each NUMA node from a given NUMA node?




NUMAtop should have a useful report to show how effective the
application is using the
local memory. 


I think that someone already pointed out that you don't seem to mention 
anything about where the thread runs as part of your proposal even 
though that is pretty important in figuring out how effective a thread 
is using local memory.  The thread won't be very effective using local 
memory if it never runs on CPUs where its local memory lives.


Also, the memory allocation policy may matter too.  For example, a 
thread may access remote memory a lot if it is accessing shared memory 
because the default memory allocation policy for shared memory is to 
spread it out by allocating it randomly across lgroups.




We need PEBS framework to implement the metrics of NUMATOP,
We need MPO
sponsor and libcpc dtrace provider sponsor to figure out where is not
effective and why.



Ok.



A better memory placement strategy suggestion is also a valuable goal of
NUMATOP.



How are you proposing to do that?



Jonathan

___
perf-discuss mailing list
perf-discuss@opensolaris.org