Re: [perf-discuss] NUMA ptools and ISM segments

jonathan chew Mon, 19 Sep 2005 10:37:19 -0700

Joe Bonasera wrote:

jonathan chew wrote:
It sounds like Eric Lowe has a theory as to whymadvise(MADV_ACCESS_*) and pmadvise(1) didn't work for migrating yourISM segment. Joe and Nils are experts on the x86/AMD64 HAT and maybe able to comment on Eric's theory that the lack of dynamic ISMunmap is preventing page migration from working.
Ok.. I've missed the start of this thread, having been distracted by
a million other things just now..

Can someone update with the problem what Eric's theory was?



I've attached Eric Lowe's email.  The relevant part was the following:

segspt_shmadvise() implements the NUMA migration policies for all shared memory
types so next touch (MADV_ACCESS_LWP) should work.  Unfortunately, the x64
HAT layer does not not implement dynamic ISM unmap.  Since the NUMA migration
code is driven by minor faults (the definition of next touch depends on that)
I suspect that is why madvise() does not work for ISM on that machine.



Jonathan

--- Begin Message ---

On Thu, Sep 15, 2005 at 11:55:21PM -0400, Marc Rocas wrote:
| 
| The one question I have is whether ISM (Intimate Shared Memory) segments are 
| immune to being coerced to relocate via pmadvise(3c)? I've tried it without 
| success. A quick look at the seg_spt.c code seemed to indicate that when an 
| spt segment is created its lgroup policy is set to LGRP_MEM_POLICY_DEFAULT 
| that will result in randomized allocation for segments greater than 8MB. 
| I've verified as much using the NUMA enhanced pmap(3c) command.

You are correct about the policy.  Since shared memory is usually, uh, shared :)
we spread it around to prevent hot-spotting.

segspt_shmadvise() implements the NUMA migration policies for all shared memory
types so next touch (MADV_ACCESS_LWP) should work.  Unfortunately, the x64
HAT layer does not not implement dynamic ISM unmap.  Since the NUMA migration
code is driven by minor faults (the definition of next touch depends on that)
I suspect that is why madvise() does not work for ISM on that machine.

| I have an ISM segment that gets consumed by HW that is limited to 32-bit 
| addressing and thus have a need to control the physical range that backs the 
| segment. At this point, it would seem that I need to allocate the memory 
| (about 300MB) in the kernel and map it back to user-land but I would lose 
| the use of 2MB pages since I have not quite figured out how to allocate 
| memory using a particular page size. Have I misinterpreted the code? Do I 
| have other options?

The twisted maze of code for ISM allocates its memory (ironically, since it's
always locked and hence doesn't need swap) through anon. There is no way
to restrict the segment to a specific PA range or otherwise impact the
allocation path until the pages are already (F_SOFTLOCK) faulted in.  The
NUMA code might help in some cases because the first lgrp happens to fall
under 4G :) so you can probably hack your way through this at the application
layer by changing the shared memory random threshold to ULONG_MAX in 
/etc/system,
binding your application thread to a CPU in the correct lgroup which has memory
only below 4G, and then doing the shmat() from there.  It's a ginormous hack,
but it will get you the results you want. :)
 
Once sysV shared memory is redesigned to not rely on the anon layer doing this
sort of thing in the kernel should become a lot easier.

Being able to specify where user memory ends up in PA space to avoid copying
it on I/O is an RFE we simply have never thought about before.  The lgroup
code is careful to avoid specifics like memory addresses, since lgrp-physical
mappings are very machine specific, so making this sort of thing work would
require adding a whole new set of segment advice ops which are used by the
physical memory allocator itself; page_create_va() takes the segment as one
of its arguments, so if we stuffed the segment PA range advice into the segment
we could dig it up down there and request memory in the "correct" range from
freelists.

-- 
Eric Lowe       Solaris Kernel Development              Austin, Texas
Sun Microsystems.  We make the net work.                x64155/+1(512)401-1155
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

--- End Message ---

_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] NUMA ptools and ISM segments

Reply via email to