Re: [perf-discuss] NUMA ptools and ISM segments

jonathan chew Mon, 26 Sep 2005 19:12:05 -0700

Marc Rocas wrote:

All,
I ended up creating a DISM segment, migrating the creating thread tothe lgrp containing cpu0, madvising, touching the memory, andMC_LOCK'ing per Jonathan's suggestions and it all worked out. Thememory was exclusively allocated from lgrp 1 (cpu0) using 2M pages perthe output of 'pmap -Ls $pid'.
To not completely hard-code cpu0, I allocated a 32MB ISM segment andwalk it page by page looking for PA that is within the first 4GB andkeep track of what lgrp it comes from to generate a list of potentiallgrps to use and pick the one that has the most free memory toallocate from. Once I've collected the information I destroy the ISMsegment. In my case, it turns out to be cpu0 as it is an Opteron boxand as Jonathan pointed out previously the lower PA range isassociated with cpu0. The discovery algorithm is not perfect as itdepends on ISM segments being randomized but hopefully that's oneaspect of the implementation that will not change anytime soon.

There may be a slightly better way to allocate memory from the lgroupcontaining the least significant physical memory usinglgrp_affinity_set(3LGRP), meminfo(2), and madvise(MADV_ACCESS_LWP).

Unfortunately, none of these ways are guaranteed to always allocatememory in the least significant 4 gigabytes of physical memory. Forexample, if the node with the least significant physical memory containsmore than 4 gigabytes of RAM, physical memory may come from this node,but be above the least significant 4 gigabytes. :-(

I spoke to one of our I/O guys to see whether there is a prescribed wayto allocate physical memory in the least significant 4 gig for DMA fromuserland. Solaris doesn't provide an existing way. The philosophy isthat users shouldn't have to care and that the I/O buffer will be copiedto/from the least significant physical memory inside the kernel if thedevice can only reach that far to DMA. I think that you may be able towrite a driver that can allocate physical memory in the desired rangeand allow access to it with mmap(2) or ioctl(2).

I'd like to understand more about your situation to get a better idea ofthe constraints and whether there is a viable solution or the need for one.

What is your device? Can you afford to let the kernel copy the user'sbuffer into another buffer which is below 4 gig to DMA to your device?If not, would it make sense to write your own driver?

By the way, the lgrp API and meminfo(2) and the information providedby everyone made it pretty easy. Thanks a bunch. Once we start playingaround with our performance suites I may have further feedback on thetools and or questions.



I'm glad that you found the lgroup APIs and meminfo(2) easy to use.

Please let us know whether you have any more questions or feedback.



Jonathan

On 9/16/05, *Marc Rocas* <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:


    Eric, Bart, and Jonathan,

    Thanks for your quick replies. See my comments below:

    On 9/16/05, *jonathan chew* < [EMAIL PROTECTED]
    <mailto:[EMAIL PROTECTED]>> wrote:

        Marc Rocas wrote:

I've been playing around with the tools on a Stinger box and

        I think

their pretty cool!



        I'm glad that you like them.  We like them too and think that
        they are
        fun to play with besides being useful for observability and
        experimenting with performance.  I hope that our MPO overview,
        man
        pages, and our examples on how to use the tools were helpful.


    All of the above were pretty useful in getting up to speed and I
    will use them to do performance experiments. I simply need to
    bring up our system in the stinger box in order to run our perf
    suites.

The one question I have is whether ISM (Intimate Shared Memory)
segments are immune to being coerced to relocate via

        pmadvise(3c)?

I've tried it without success. A quick look at the seg_spt.c code
seemed to indicate that when an spt segment is created its lgroup
policy is set to LGRP_MEM_POLICY_DEFAULT that will result in
randomized allocation for segments greater than 8MB. I've

        verified as

much using the NUMA enhanced pmap(3c) command.



        It sounds like Eric Lowe has a theory as to why
        madvise(MADV_ACCESS_*)
        and pmadvise(1) didn't work for migrating your ISM
        segment.  Joe and
        Nils are experts on the x86/AMD64 HAT and may be able to
        comment on
        Eric's theory that the lack of dynamic ISM unmap is preventing
        page
        migration from working.

        I'll see whether I can reproduce the problem.

        Which version of Solaris are you using?


    Solaris 10 GA with no kernel patches applied.

I have an ISM segment that gets consumed by HW that is

        limited to

32-bit addressing and thus have a need to control the

        physical range

that backs the segment. At this point, it would seem that I

        need to

allocate the memory (about 300MB) in the kernel and map it

        back to

user-land but I would lose the use of 2MB pages since I have

        not quite

figured out how to allocate memory using a particular page

        size. Have

I misinterpreted the code? Do I have other options?



        Bart's suggestion is the simplest (brute force) way.  Eric's
        suggestion
        sounded a little painful.  I have a couple of other options
        below, but
        can't think of a nice way to specify that your application
        needs low
        physical memory.  So, I want to understand what you are doing
        better to
        see if there is any better way.

        Can you please tell me more about your situation and
        requirements?  What
        is the hardware that needs the 32-bit addressing (framebuffer,
        device
        needing DMA, etc.)?  Besides needing to be in low physical
        memory, does
        it need to be wired down and shared?


    A device needing DMA. It needs to be wired down as well as shared.
    We're running fine with Bart's suggestion but still need a way to
    park the segment in the lower 4GB PA range since we actually want
    to run experiments with a minimum of 8GB of RAM.

        Jonathan

        PS
        Assuming that you can change your code, here are a couple of
        other
        options that are less painful than Eric's suggestion but still
        aren't
        very elegant because of the low physical memory requirement:

        - Use DISM instead of ISM which is Dynamic Intimate Shared
        Memory and is
        pageable (see SHM_PAGEABLE flag to shmat(2))

        OR

        - Use  mmap(2) and the MAP_ANON (and MAP_SHARED if you need shared
        memory) flag to allocate (shared) anonymous memory

        - Call memcntl(2) with MC_HAT_ADVISE to specify that you want
        large pages

        AND

        - Call madvise(MADV_ACCESS_LWP) on your mmap-ed or DISM
        segment to say
        that the next thread to access it will use it a lot

        - Access it from CPU 0 on your Stinger.  I don't like this
        part because
        this is hardware implementation specific.  It turns out that
        the low
        physical memory usually lives near CPU 0 on an Opteron
        box.  You can use
        liblgrp(3LIB) to discover which leaf lgroup contains CPU 0 and
        lgrp_affinity_set(3LGRP) to set a strong affinity for that
        lgroup (which
        will set the home lgroup for the thread to that
        lgroup).  Alternatively,
        you can use processor_bind(2) to bind/unbind to CPU 0.

        - Use the MC_LOCK flag to memcntl(2) to lock down the memory
        for your
        segment if you want the physical memory to stay there (until

you unlock it)


    I will try using DISM, madvise, and processor_bind as it will
    allow me to avoid changing the value of grp_shm_random_thresh.

    Lastly, I tried relocating the memory by following the
    instructions in Alexander Kolbasov's blog (Memory Placement) by
    writing a simple app that attached to the existing segment with a
    few sleep() calls to allow me to type the following commands:

    # pmap -Ls $pid | fgrep "ism shimd=0x0"
    E9E00000    2048K    2M  rwxsR    1   [ ism shmid=0x0 ]
    ...
    # plgrp -S 1 $pid
    # pmap -o E9E00000=access_lwp $pid

    An initial sleep of 3 minutes to get the above commands done and
    then the app does a few writes to the segment and sleeps for
    another minute to allow me to invoke pmap -Ls and verify wether
    the segment migrated or not.

    I'll try the experiments on Monday when I'm back at the office.

    Again, thanks for all the help.

    --Marc


_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] NUMA ptools and ISM segments

Reply via email to