RE: [perf-discuss] Comments from NUMA observability tools

Philip Beevers Wed, 30 Nov 2005 14:40:57 -0800

> More cpmments for pmadvise:

The product I work on is already shipping something which does the
equivalent of a small subset of madvise, so it would be generally useful to
us.


I have to echo the other positive comments about the lgrp tools. It's
shedding light on a whole area which has previously been difficult to
appreciate. The ideal would be to get some idea via the tools of the latency
penalty of accessing pages across groups [is the group configuration just
statically configured somewhere? If so, could this info be a part of that?],
and the logical next step is the amount of time my app is spending waiting
to access those pages - but obviously that's a way off.

And one specific point...

>  - "plgrp -G <pid>" returns a number plus an extra blank 
> line.  Why the 
> blank line?  bug? feature?
> - Can you add a flag to up the verbosity?  Getting just the 
> lgroup ID 
> back is handy, 

I sort-of agree; the first one of these tools I ran was plgrp, and whilst
it's my fault for not reading the docs, I initially wondered just what that
number was (if you're new to this stuff you could conceivably think it's a
CPU id or something). Maybe just a couple of words like "Home lgrp: " (sorry
if the terminology's a bit off) in front of it would make it clearer. If the
terseness is intentional (I can see it would make scripting easier) feel
free to ignore!

Getting a bit radical for a minute - with (for example) psrset I can get
hold of _all_ processor set bindings using psrset -q. I think it would be
useful to be able to get home lgrp information across a number of processes;
it's probably not the right place for it, but I'm imagining something like
being able to specify the home lgrp as an output column on ps(1).


> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Alexander Kolbasov
> Sent: 15 November 2005 21:16
> To: perf-discuss@opensolaris.org
> Subject: [perf-discuss] Comments from NUMA observability tools
> 
> 
> We received some internal comments for the NUMA observability 
> tools. Here they are with my comments separated with // lines:
> 
>  - "plgrp -G <pid>" returns a number plus an extra blank 
> line.  Why the 
> blank line?  bug? feature?
> 
> // Bug
> 
>  - Can you add a flag to up the verbosity?  Getting just the 
> lgroup ID 
> back is handy, but it might be nice to get some more info, like what 
> CPU(s) that lgroup is associated with, what other lgroups are 
> available, 
> how much memory each lgroup has access to, etc.  Can that be done?
> 
> // This is delegated for lgrpinfo utility which prints all 
> the information about // lgroups. The plgrp utility 
> specifically deals with process and thread lgroup // placement.
> 
>  - How do I figure out how many lgroups are on the machine?  
> Can we make 
> plgrp report that by default instead of the "-h" help output?  Seems 
> like a reasonable place to find this info.
> 
> // This is done by lgrpinfo
> 
>  - Can you change the lgroup of <pid> if pbind is used on the same 
> <pid>?  It does not appear that way to me in my testing.  
> 
> // It is not possible. Implementation-wise plgrp sets thread 
> affinity to the // target lgroup which, in the absence of 
> processor sets and processor bindings // also changes the 
> "home" lgroup. The processor sets and processor bindings // 
> assignments are considered first, so while plgrp can 
> successfully set thread // affinities it can't actually set 
> the home in such cases. The specified // affinities will 
> start playing when a thread is unbound or moved outside a // 
> processor set. 
> 
> IMPORTANT: Side bar conversation... It would be very handy to 
> be able to 
> bind a process to a particular CPU and then have all of its memory be 
> local to a different CPU, for some of the testing I am doing 
> on G4.  Is 
> there a way in S10 that I can do this?  numactl in linux can 
> do that.  I 
> could really use it on Solaris, right now...
> 
> // There is no way of doing exactly this.
> 
> - Apparently, you can tell plgrp to set to pid to an lgrp 
> value and it 
> does not complain if it fails...
> 
> // This is a bug that should be fixed.
> 
> # pgrep lat
> 14894
> 14866
> # ./plgrp -G 14866
> 1
> 
> # ./plgrp -S 2 14866
> # ./plgrp -G 14866
> 1
> 
> It should give you some sort of indication that something was 
> not done, 
> and a hopefully a little bit on why.  Right?
> 
> // Right
> 
> In fact, you can try some ridiculous values without a peep out of it 
> (unless I miss why 10 or 20 make sense), and nothing changes.
> 
> # pgrep lat_mem
> 14821
> 14711
> # ./plgrp -S 4 14711
> # ./plgrp -S 10 14711
> # ./plgrp -S 20 14711
> # ./plgrp -G 14711
> 1
> 
> # ./plgrp -S 20 14711
> # ./plgrp -G 14711
> 1
> 
> // This is a bug.
> 
> Here is another bunch of comments:
> 
> I've been applying the tools to a few problems and benchmarks 
> I've been involved with over the last few days and I intend 
> to keep on applying them to what I can from now on. Overall, 
> I think that they are excellent tools and exactly what we 
> need. I'll follow up with more details, thoughts and results 
> in the next few days.
> 
> Just having the ability to observe lgroup  topology and usage 
> is a huge step forward and opens up a whole area of 
> investigations that were, up to now, fairly closed off.
> 
> >>IMPORTANT: Side bar conversation... It would be very handy 
> to be able 
> >>to
> >>bind a process to a particular CPU and then have all of its 
> memory be 
> >>local to a different CPU, for some of the testing I am 
> doing on G4.  Is 
> >>there a way in S10 that I can do this?  numactl in linux 
> can do that.  I 
> >>could really use it on Solaris, right now...
> >>    
> >>
> >
> >This is interesting. Can you explain why you would like such 
> >functionality?
> >  
> >
> 
> I've been doing exactly this today with some experimentation 
> with the STREAM benchmark.I think this would work (heap example):
> 
> - Change the thread in questions home lgroup to the CPU where 
> you want the
>   memory allocated,
> - use 'pmadvise -o heap=lwp_access' on the process
> - the memory should now get allocated in the newly homed lgrp
>   (check with 'pmap -L' and lgrpinfo (if the allocation is 
> large enough to 
>   notice)).
> - rehome the lgrp to another CPU or just bind it.
> 
> This is how I was doing it and there may well be other/better 
> ways - I'm all ears. However, this is all a bit unwieldy and 
> I would very much like to be able to say, thread X's heap 
> should be allocated from lgroup Y, fairly much like the 
> 'migrate range' option of SGI's dplace(1) commnd language. 
> Note that I've never used dplace(1) but I do work with ex 
> SGI'ers who speak well of it. It certainly looks to be powerful stuff.
> 
> There was also a question:
> 
> I grabbed the ptools-bin-0.1.2.tar.gz off of opensolaris.org, and 
> lgrpinfo was not in it.  I had seen it before.  Do you plan 
> to include 
> it in that tar.gz?
> 
> // It is distributed separately from
> // 
> http://www.opensolaris.org/os/community/performance/numa/obser
vability/perllgrp/
// or via CPAN at http://search.cpan.org/dist/Solaris-Lgrp/

More cpmments for pmadvise:

It would be nice to apply memory placement advice from the start to a
process. There is a nice DTrace example using system() calls to apply
pmadvise to a just started process.

This is fine but a bit messy. A nice option would be to make pmadise a
libproc consumer and have it exec the target program. In this way we could
possibly do things such as :

pmadvise -o heap=access_lwp '/path/to/command -flags'

Then again, how about having a control file along the lines of the way we do
mpss. In it we could specify policy to apply to a range of processes and
apply it via a preloader. e.g:

oracle*:heap=access_lwp,stack=access_lwp

// Please see madv.so.1(1) for this kind of functionality

__
Compiled by Alex Kolbasov


_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

RE: [perf-discuss] Comments from NUMA observability tools

Reply via email to