Marc Rocas wrote:

All,

I ended up creating a DISM segment, migrating the creating thread to the lgrp containing cpu0, madvising, touching the memory, and MC_LOCK'ing per Jonathan's suggestions and it all worked out. The memory was exclusively allocated from lgrp 1 (cpu0) using 2M pages per the output of 'pmap -Ls $pid'.

To not completely hard-code cpu0, I allocated a 32MB ISM segment and walk it page by page looking for PA that is within the first 4GB and keep track of what lgrp it comes from to generate a list of potential lgrps to use and pick the one that has the most free memory to allocate from. Once I've collected the information I destroy the ISM segment. In my case, it turns out to be cpu0 as it is an Opteron box and as Jonathan pointed out previously the lower PA range is associated with cpu0. The discovery algorithm is not perfect as it depends on ISM segments being randomized but hopefully that's one aspect of the implementation that will not change anytime soon.


There may be a slightly better way to allocate memory from the lgroup containing the least significant physical memory using lgrp_affinity_set(3LGRP), meminfo(2), and madvise(MADV_ACCESS_LWP).

Unfortunately, none of these ways are guaranteed to always allocate memory in the least significant 4 gigabytes of physical memory. For example, if the node with the least significant physical memory contains more than 4 gigabytes of RAM, physical memory may come from this node, but be above the least significant 4 gigabytes. :-(

I spoke to one of our I/O guys to see whether there is a prescribed way to allocate physical memory in the least significant 4 gig for DMA from userland. Solaris doesn't provide an existing way. The philosophy is that users shouldn't have to care and that the I/O buffer will be copied to/from the least significant physical memory inside the kernel if the device can only reach that far to DMA. I think that you may be able to write a driver that can allocate physical memory in the desired range and allow access to it with mmap(2) or ioctl(2).

I'd like to understand more about your situation to get a better idea of the constraints and whether there is a viable solution or the need for one.

What is your device? Can you afford to let the kernel copy the user's buffer into another buffer which is below 4 gig to DMA to your device? If not, would it make sense to write your own driver?


By the way, the lgrp API and meminfo(2) and the information provided by everyone made it pretty easy. Thanks a bunch. Once we start playing around with our performance suites I may have further feedback on the tools and or questions.


I'm glad that you found the lgroup APIs and meminfo(2) easy to use.

Please let us know whether you have any more questions or feedback.



Jonathan


On 9/16/05, *Marc Rocas* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

    Eric, Bart, and Jonathan,

    Thanks for your quick replies. See my comments below:

    On 9/16/05, *jonathan chew* < [EMAIL PROTECTED]
    <mailto:[EMAIL PROTECTED]>> wrote:

        Marc Rocas wrote:

I've been playing around with the tools on a Stinger box and
        I think
their pretty cool!


        I'm glad that you like them.  We like them too and think that
        they are
        fun to play with besides being useful for observability and
        experimenting with performance.  I hope that our MPO overview,
        man
        pages, and our examples on how to use the tools were helpful.


    All of the above were pretty useful in getting up to speed and I
    will use them to do performance experiments. I simply need to
    bring up our system in the stinger box in order to run our perf
    suites.

The one question I have is whether ISM (Intimate Shared Memory)
segments are immune to being coerced to relocate via
        pmadvise(3c)?
I've tried it without success. A quick look at the seg_spt.c code
seemed to indicate that when an spt segment is created its lgroup
policy is set to LGRP_MEM_POLICY_DEFAULT that will result in
randomized allocation for segments greater than 8MB. I've
        verified as
much using the NUMA enhanced pmap(3c) command.


        It sounds like Eric Lowe has a theory as to why
        madvise(MADV_ACCESS_*)
        and pmadvise(1) didn't work for migrating your ISM
        segment.  Joe and
        Nils are experts on the x86/AMD64 HAT and may be able to
        comment on
        Eric's theory that the lack of dynamic ISM unmap is preventing
        page
        migration from working.

        I'll see whether I can reproduce the problem.

        Which version of Solaris are you using?


    Solaris 10 GA with no kernel patches applied.

I have an ISM segment that gets consumed by HW that is
        limited to
32-bit addressing and thus have a need to control the
        physical range
that backs the segment. At this point, it would seem that I
        need to
allocate the memory (about 300MB) in the kernel and map it
        back to
user-land but I would lose the use of 2MB pages since I have
        not quite
figured out how to allocate memory using a particular page
        size. Have
I misinterpreted the code? Do I have other options?


        Bart's suggestion is the simplest (brute force) way.  Eric's
        suggestion
        sounded a little painful.  I have a couple of other options
        below, but
        can't think of a nice way to specify that your application
        needs low
        physical memory.  So, I want to understand what you are doing
        better to
        see if there is any better way.

        Can you please tell me more about your situation and
        requirements?  What
        is the hardware that needs the 32-bit addressing (framebuffer,
        device
        needing DMA, etc.)?  Besides needing to be in low physical
        memory, does
        it need to be wired down and shared?


    A device needing DMA. It needs to be wired down as well as shared.
    We're running fine with Bart's suggestion but still need a way to
    park the segment in the lower 4GB PA range since we actually want
    to run experiments with a minimum of 8GB of RAM.

        Jonathan

        PS
        Assuming that you can change your code, here are a couple of
        other
        options that are less painful than Eric's suggestion but still
        aren't
        very elegant because of the low physical memory requirement:

        - Use DISM instead of ISM which is Dynamic Intimate Shared
        Memory and is
        pageable (see SHM_PAGEABLE flag to shmat(2))

        OR

        - Use  mmap(2) and the MAP_ANON (and MAP_SHARED if you need shared
        memory) flag to allocate (shared) anonymous memory

        - Call memcntl(2) with MC_HAT_ADVISE to specify that you want
        large pages

        AND

        - Call madvise(MADV_ACCESS_LWP) on your mmap-ed or DISM
        segment to say
        that the next thread to access it will use it a lot

        - Access it from CPU 0 on your Stinger.  I don't like this
        part because
        this is hardware implementation specific.  It turns out that
        the low
        physical memory usually lives near CPU 0 on an Opteron
        box.  You can use
        liblgrp(3LIB) to discover which leaf lgroup contains CPU 0 and
        lgrp_affinity_set(3LGRP) to set a strong affinity for that
        lgroup (which
        will set the home lgroup for the thread to that
        lgroup).  Alternatively,
        you can use processor_bind(2) to bind/unbind to CPU 0.

        - Use the MC_LOCK flag to memcntl(2) to lock down the memory
        for your
        segment if you want the physical memory to stay there (until
you unlock it)

    I will try using DISM, madvise, and processor_bind as it will
    allow me to avoid changing the value of grp_shm_random_thresh.

    Lastly, I tried relocating the memory by following the
    instructions in Alexander Kolbasov's blog (Memory Placement) by
    writing a simple app that attached to the existing segment with a
    few sleep() calls to allow me to type the following commands:

    # pmap -Ls $pid | fgrep "ism shimd=0x0"
    E9E00000    2048K    2M  rwxsR    1   [ ism shmid=0x0 ]
    ...
    # plgrp -S 1 $pid
    # pmap -o E9E00000=access_lwp $pid

    An initial sleep of 3 minutes to get the above commands done and
    then the app does a few writes to the segment and sleeps for
    another minute to allow me to invoke pmap -Ls and verify wether
    the segment migrated or not.

    I'll try the experiments on Monday when I'm back at the office.

    Again, thanks for all the help.

    --Marc




_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Reply via email to