Re: [perf-discuss] Re: Puzzling scheduler behavior
Dave, It sounds like you have an interesting application. You might want to create a processor set, leave some CPUs outside the psrset for other threads to run on, and run your application in a processor set to minimize interference from other threads. As long as there are enough CPUs for your application in the psrset, you should see the number of migrations go down because there won't be any interference from other threads. To get a better understanding of the Solaris performance optimizations done for NUMA, you might want to check out the overview of Memory Placement Optimization (MPO) at: http://opensolaris.org/os/community/performance/mpo_overview.pdf The stickiness that you observed is because of MPO. Binding to a processor set containing one CPU set the home lgroup of the thread to the lgroup containing that CPU and destroying the psrset just left the thread homed there. Your shared memory is probably spread across the system already because the default MPO memory allocation policy for shared memory is to allocate the memory from random lgroups across the system. We have some prototype observability tools which allow you to examine the lgroup hierarchy and it contents and observe and/or control how the threads and memory are placed among lgroups (see http://opensolaris.org/os/community/performance/numa/observability/). The source, binaries, and man pages are there. Jonathan David McDaniel (damcdani) wrote: Very, very enlightening, Eric. Its really terrific to have this kind of channel for dialog. The "return to home base" behavior you describe is clearly consistent with what I see and makes perfect sense. Let me followup with a question. In this application, processes have not only their "own" memory, ie heap, stack program text and data, etc, but they also share a moderately large (~ 2-5GB today) amount of memory in the form of mmap'd files. From Sherry Moore's previous posts, I'm assuming that at startup time that would actually be all allocated in one board. Since I'm contemplating moving processes onto psrsets off that board, would it be plausible to assume that I might get slightly better net throughput if I could somehow spread that across all the boards? I know its speculation of the highest order, so maybe my real question is whether that's even worth testing. In any case, I'd love to turn the knob you mention and I'll look on the performance community page and see what kind of trouble I can get into. If there are any particular items you think I should check out, guidance is welcome. Regards -d -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Eric C. Saxe Sent: Thursday, September 01, 2005 1:48 AM To: perf-discuss@opensolaris.org Subject: [perf-discuss] Re: Puzzling scheduler behavior Hi David, Since your v1280 systems has NUMA characteristics, the bias that you see for one of the boards may be a result of the kernel trying to run your application's threads "close" to where they have allocated their memory. We also generally try to keep threads in the same process together, since they generally tend to work on the same data. This might explain why one of the boards is so much busier than the others. So yes, the interesting piece of this seems to be the higher than expected run queue wait time (latency) as seen via prstat -Lm. Even with the thread-to-board/memory affinity I mentioned above, it generally shouldn't be the case that threads are willing to hang out on a run queue waiting for a CPU on their "home" when that thread *could* actually run immediately on a "remote" (off-board) CPU. Better to run remote, than not at all, or at least the saying goes :) In the case where a thread is dispatched remotely because all home CPUs are busy, the thread will try to migrate back home the next time it comes through the dispatcher and finds it can run immediately at home (either because there's an idle CPU, or because one of the running threads is lower priority than us, and we can preempt it). This migrating around means that the thread will tend to spend more time waiting on run queues, since it has to either wait for the idle() thread to switch off, or for the lower priority thread it's able to preempt to surrender the CPU. Either way, the thread shouldn't have to wait long to get the CPU, but it will have to wait a non-zero amount of time. What does the prstat -Lm output look like exactly? Is it a lot of wait time, or just more than you would expect? By the way, just to be clear, when I say "board" what I should be saying is lgroup (or locality group). This is the Solaris abstraction for a set of CPU and memory resources that are close to one another. On your system, it turns out that kernel creates an lgroup for each board, and each thread is given an affinity for one of the lgroups, such that it will try to run on the CPUs (and allocate memory from that group
Re: [perf-discuss] Re: Puzzling scheduler behavior
David McDaniel (damcdani) wrote: Thanks, Jonathon for the good insights. I'll be digging into the references you mentioned. Yes, at the end of the day I'm sure binding to processor sets is part of the plan; having already done so in a rather rote way I can demonstrate a very dramatic reduction in apparent cpu utilzation, on the order of 25-30%. But before I commit engineers to casting something in stone I want to make sure I understand the defaults and the side effects of doing so since it potentially results in defeating other improvements that Sun has done or will be doing. Sure. No problem. The overview and man pages for our tools are pretty short. The tools are very easy to use and kind of fun to play with. I'm going to try to post a good example of how to use them later today. I think that using a psrset is an interesting experiment to see whether interference is a big factor in all the migrations. It would be nice not to have to do that by default though. It sounds like you already tried this experiment though and noticed a big difference. Did the migrations drop dramatically? What else is running on the system when you don't use a psrset? Jonathan -Original Message- From: jonathan chew [mailto:[EMAIL PROTECTED] Sent: Thursday, September 01, 2005 11:50 AM To: David McDaniel (damcdani) Cc: Eric C. Saxe; perf-discuss@opensolaris.org Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior Dave, It sounds like you have an interesting application. You might want to create a processor set, leave some CPUs outside the psrset for other threads to run on, and run your application in a processor set to minimize interference from other threads. As long as there are enough CPUs for your application in the psrset, you should see the number of migrations go down because there won't be any interference from other threads. To get a better understanding of the Solaris performance optimizations done for NUMA, you might want to check out the overview of Memory Placement Optimization (MPO) at: http://opensolaris.org/os/community/performance/mpo_overview.pdf The stickiness that you observed is because of MPO. Binding to a processor set containing one CPU set the home lgroup of the thread to the lgroup containing that CPU and destroying the psrset just left the thread homed there. Your shared memory is probably spread across the system already because the default MPO memory allocation policy for shared memory is to allocate the memory from random lgroups across the system. We have some prototype observability tools which allow you to examine the lgroup hierarchy and it contents and observe and/or control how the threads and memory are placed among lgroups (see http://opensolaris.org/os/community/performance/numa/observabi lity/). The source, binaries, and man pages are there. Jonathan David McDaniel (damcdani) wrote: Very, very enlightening, Eric. Its really terrific to have this kind of channel for dialog. The "return to home base" behavior you describe is clearly consistent with what I see and makes perfect sense. Let me followup with a question. In this application, processes have not only their "own" memory, ie heap, stack program text and data, etc, but they also share a moderately large (~ 2-5GB today) amount of memory in the form of mmap'd files. From Sherry Moore's previous posts, I'm assuming that at startup time that would actually be all allocated in one board. Since I'm contemplating moving processes onto psrsets off that board, would it be plausible to assume that I might get slightly better net throughput if I could somehow spread that across all the boards? I know its speculation of the highest order, so maybe my real question is whether that's even worth testing. In any case, I'd love to turn the knob you mention and I'll look on the performance community page and see what kind of trouble I can get into. If there are any particular items you think I should check out, guidance is welcome. Regards -d -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Eric C. Saxe Sent: Thursday, September 01, 2005 1:48 AM To: perf-discuss@opensolaris.org Subject: [perf-discuss] Re: Puzzling scheduler behavior Hi David, Since your v1280 systems has NUMA characteristics, the bias that you see for one of the boards may be a result of the kernel trying to run your application's threads "close" to where they have allocated their memory. We also generally try to keep threads in the same process together, since they generally tend to work on the same data. This might explain why one
Re: [perf-discuss] Re: Puzzling scheduler behavior
Dave, Sorry, I forgot to reply to this sooner. Yes, I was just curious what else was running to see whether we would expect your application to be perturbed much. There could be a load imbalance due to the daemons throwing everything off once in awhile. This could be affecting how the threads in your application are distributed across the nodes in your NUMA machine. Each thread is assigned a home locality group upon creation and the kernel will tend to run it on CPUs in its home lgroup and allocate its memory there to minimize latency and maximize performance by default. There is an lgroup corresponding to each of the nodes (boards) in your NUMA machine. The assignment of threads to lgroups is based on lgroup load averages, so other threads may cause the lgroup load average to go up or down and thus affect how threads are placed among lgroups. You can use plgrp(1) which is available on our NUMA observability web page at http://opensolaris.org/os/community/performance/numa/observability to see where your application processes/threads are homed. Then we can see whether they are distributed very well. You can also use plgrp(1) to change the home lgroup of a thread, but should be careful because there can be side effects as explained in the example referred to below. There are man pages, source, and binaries for our tools on the web page. I wrote up a good example of how to use the tools to understand, observe, and affect thread and memory placement among lgroups on a NUMA machine and posted it on the web page in http://opensolaris.org/os/community/performance/example.txt. You can also try using the lgrp_expand_proc_thresh tunable that Eric suggested last week. Are the migrations that you are seeing when not running a psrset causing a performance problem for your application? Jonathan David McDaniel (damcdani) wrote: When using prsets, the migrations and involuntary context switches go essentially to zero. As far as "other stuff", not quite sure what you mean, but this application runs on a dedicated server so there is no stuff of a casueal nature, however there is a lot of what I'll glom into the category of "support" tasks, ie ntp daemons, nscd flushing caches, fsflush running around backing up pages, etc. Was that what you meant? -Original Message- From: jonathan chew [mailto:[EMAIL PROTECTED] Sent: Thursday, September 01, 2005 12:45 PM To: David McDaniel (damcdani) Cc: Eric C. Saxe; perf-discuss@opensolaris.org Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior David McDaniel (damcdani) wrote: Thanks, Jonathon for the good insights. I'll be digging into the references you mentioned. Yes, at the end of the day I'm sure binding to processor sets is part of the plan; having already done so in a rather rote way I can demonstrate a very dramatic reduction in apparent cpu utilzation, on the order of 25-30%. But before I commit engineers to casting something in stone I want to make sure I understand the defaults and the side effects of doing so since it potentially results in defeating other improvements that Sun has done or will be doing. Sure. No problem. The overview and man pages for our tools are pretty short. The tools are very easy to use and kind of fun to play with. I'm going to try to post a good example of how to use them later today. I think that using a psrset is an interesting experiment to see whether interference is a big factor in all the migrations. It would be nice not to have to do that by default though. It sounds like you already tried this experiment though and noticed a big difference. Did the migrations drop dramatically? What else is running on the system when you don't use a psrset? Jonathan -Original Message- From: jonathan chew [mailto:[EMAIL PROTECTED] Sent: Thursday, September 01, 2005 11:50 AM To: David McDaniel (damcdani) Cc: Eric C. Saxe; perf-discuss@opensolaris.org Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior Dave, It sounds like you have an interesting application. You might want to create a processor set, leave some CPUs outside the psrset for other threads to run on, and run your application in a processor set to minimize interference from other threads. As long as there are enough CPUs for your application in the psrset, you should see the number of migrations go down because there won't be any interference from other threads. To get a better understanding of the Solaris performance optimizations done for NUMA, you might want to check out the overview of Memory Placement Optimization (MPO) at: http://opensolaris.org/os/community/performance/mpo_overview.pdf The stickiness that you observed is because of MPO. Bind
Re: [perf-discuss] NUMA ptools and ISM segments
Marc Rocas wrote: I've been playing around with the tools on a Stinger box and I think their pretty cool! I'm glad that you like them. We like them too and think that they are fun to play with besides being useful for observability and experimenting with performance. I hope that our MPO overview, man pages, and our examples on how to use the tools were helpful. The one question I have is whether ISM (Intimate Shared Memory) segments are immune to being coerced to relocate via pmadvise(3c)? I've tried it without success. A quick look at the seg_spt.c code seemed to indicate that when an spt segment is created its lgroup policy is set to LGRP_MEM_POLICY_DEFAULT that will result in randomized allocation for segments greater than 8MB. I've verified as much using the NUMA enhanced pmap(3c) command. It sounds like Eric Lowe has a theory as to why madvise(MADV_ACCESS_*) and pmadvise(1) didn't work for migrating your ISM segment. Joe and Nils are experts on the x86/AMD64 HAT and may be able to comment on Eric's theory that the lack of dynamic ISM unmap is preventing page migration from working. I'll see whether I can reproduce the problem. Which version of Solaris are you using? I have an ISM segment that gets consumed by HW that is limited to 32-bit addressing and thus have a need to control the physical range that backs the segment. At this point, it would seem that I need to allocate the memory (about 300MB) in the kernel and map it back to user-land but I would lose the use of 2MB pages since I have not quite figured out how to allocate memory using a particular page size. Have I misinterpreted the code? Do I have other options? Bart's suggestion is the simplest (brute force) way. Eric's suggestion sounded a little painful. I have a couple of other options below, but can't think of a nice way to specify that your application needs low physical memory. So, I want to understand what you are doing better to see if there is any better way. Can you please tell me more about your situation and requirements? What is the hardware that needs the 32-bit addressing (framebuffer, device needing DMA, etc.)? Besides needing to be in low physical memory, does it need to be wired down and shared? Jonathan PS Assuming that you can change your code, here are a couple of other options that are less painful than Eric's suggestion but still aren't very elegant because of the low physical memory requirement: - Use DISM instead of ISM which is Dynamic Intimate Shared Memory and is pageable (see SHM_PAGEABLE flag to shmat(2)) OR - Use mmap(2) and the MAP_ANON (and MAP_SHARED if you need shared memory) flag to allocate (shared) anonymous memory - Call memcntl(2) with MC_HAT_ADVISE to specify that you want large pages AND - Call madvise(MADV_ACCESS_LWP) on your mmap-ed or DISM segment to say that the next thread to access it will use it a lot - Access it from CPU 0 on your Stinger. I don't like this part because this is hardware implementation specific. It turns out that the low physical memory usually lives near CPU 0 on an Opteron box. You can use liblgrp(3LIB) to discover which leaf lgroup contains CPU 0 and lgrp_affinity_set(3LGRP) to set a strong affinity for that lgroup (which will set the home lgroup for the thread to that lgroup). Alternatively, you can use processor_bind(2) to bind/unbind to CPU 0. - Use the MC_LOCK flag to memcntl(2) to lock down the memory for your segment if you want the physical memory to stay there (until you unlock it) ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] NUMA ptools and ISM segments
PPS I forgot to ask what you did to test whether pmadvise(1) would migrate your ISM segment and how you know that it didn't. jonathan chew wrote: Marc Rocas wrote: I've been playing around with the tools on a Stinger box and I think their pretty cool! I'm glad that you like them. We like them too and think that they are fun to play with besides being useful for observability and experimenting with performance. I hope that our MPO overview, man pages, and our examples on how to use the tools were helpful. The one question I have is whether ISM (Intimate Shared Memory) segments are immune to being coerced to relocate via pmadvise(3c)? I've tried it without success. A quick look at the seg_spt.c code seemed to indicate that when an spt segment is created its lgroup policy is set to LGRP_MEM_POLICY_DEFAULT that will result in randomized allocation for segments greater than 8MB. I've verified as much using the NUMA enhanced pmap(3c) command. It sounds like Eric Lowe has a theory as to why madvise(MADV_ACCESS_*) and pmadvise(1) didn't work for migrating your ISM segment. Joe and Nils are experts on the x86/AMD64 HAT and may be able to comment on Eric's theory that the lack of dynamic ISM unmap is preventing page migration from working. I'll see whether I can reproduce the problem. Which version of Solaris are you using? I have an ISM segment that gets consumed by HW that is limited to 32-bit addressing and thus have a need to control the physical range that backs the segment. At this point, it would seem that I need to allocate the memory (about 300MB) in the kernel and map it back to user-land but I would lose the use of 2MB pages since I have not quite figured out how to allocate memory using a particular page size. Have I misinterpreted the code? Do I have other options? Bart's suggestion is the simplest (brute force) way. Eric's suggestion sounded a little painful. I have a couple of other options below, but can't think of a nice way to specify that your application needs low physical memory. So, I want to understand what you are doing better to see if there is any better way. Can you please tell me more about your situation and requirements? What is the hardware that needs the 32-bit addressing (framebuffer, device needing DMA, etc.)? Besides needing to be in low physical memory, does it need to be wired down and shared? Jonathan PS Assuming that you can change your code, here are a couple of other options that are less painful than Eric's suggestion but still aren't very elegant because of the low physical memory requirement: - Use DISM instead of ISM which is Dynamic Intimate Shared Memory and is pageable (see SHM_PAGEABLE flag to shmat(2)) OR - Use mmap(2) and the MAP_ANON (and MAP_SHARED if you need shared memory) flag to allocate (shared) anonymous memory - Call memcntl(2) with MC_HAT_ADVISE to specify that you want large pages AND - Call madvise(MADV_ACCESS_LWP) on your mmap-ed or DISM segment to say that the next thread to access it will use it a lot - Access it from CPU 0 on your Stinger. I don't like this part because this is hardware implementation specific. It turns out that the low physical memory usually lives near CPU 0 on an Opteron box. You can use liblgrp(3LIB) to discover which leaf lgroup contains CPU 0 and lgrp_affinity_set(3LGRP) to set a strong affinity for that lgroup (which will set the home lgroup for the thread to that lgroup). Alternatively, you can use processor_bind(2) to bind/unbind to CPU 0. - Use the MC_LOCK flag to memcntl(2) to lock down the memory for your segment if you want the physical memory to stay there (until you unlock it) ___ perf-discuss mailing list perf-discuss@opensolaris.org ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] Re: Puzzling scheduler behavior
David McDaniel (damcdani) wrote: Thanks for the feedback, Jonathan. I've got it on my todo list to get those tools and go spelunking a bit. I cant really say that we have a performance problem, its more along the lines of me trying to use the greatly improved observability tools in Solaris to get a better understanding of things. In any case, its pretty much relegated to a science project right now because we cant ship anything that's not part of some "official" distribution? Ok. The tools are pretty easy to use. If you have any questions, we would be happy to help and welcome any feedback on the tools or documentation. When you say that you can't ship anything that's not part of some "official" distribution, are you referring to our tools or your software? I am suggesting using our tools to understand the behavior of your application and its interaction with the operating system better and determine whether there is a problem or not. If there is a problem in the OS, we can try to fix the default behavior. As Sasha pointed out, it is our intention to ship our observability tools, but we wanted to let the OpenSolaris community try them first to see whether they are useful. Last but not least, we can try running your application if you want. Jonathan -----Original Message- From: jonathan chew [mailto:[EMAIL PROTECTED] Sent: Friday, September 09, 2005 6:08 PM To: David McDaniel (damcdani) Cc: Eric C. Saxe; perf-discuss@opensolaris.org Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior Dave, Sorry, I forgot to reply to this sooner. Yes, I was just curious what else was running to see whether we would expect your application to be perturbed much. There could be a load imbalance due to the daemons throwing everything off once in awhile. This could be affecting how the threads in your application are distributed across the nodes in your NUMA machine. Each thread is assigned a home locality group upon creation and the kernel will tend to run it on CPUs in its home lgroup and allocate its memory there to minimize latency and maximize performance by default. There is an lgroup corresponding to each of the nodes (boards) in your NUMA machine. The assignment of threads to lgroups is based on lgroup load averages, so other threads may cause the lgroup load average to go up or down and thus affect how threads are placed among lgroups. You can use plgrp(1) which is available on our NUMA observability web page at http://opensolaris.org/os/community/performance/numa/observabi lity to see where your application processes/threads are homed. Then we can see whether they are distributed very well. You can also use plgrp(1) to change the home lgroup of a thread, but should be careful because there can be side effects as explained in the example referred to below. There are man pages, source, and binaries for our tools on the web page. I wrote up a good example of how to use the tools to understand, observe, and affect thread and memory placement among lgroups on a NUMA machine and posted it on the web page in http://opensolaris.org/os/community/performance/example.txt. You can also try using the lgrp_expand_proc_thresh tunable that Eric suggested last week. Are the migrations that you are seeing when not running a psrset causing a performance problem for your application? Jonathan David McDaniel (damcdani) wrote: When using prsets, the migrations and involuntary context switches go essentially to zero. As far as "other stuff", not quite sure what you mean, but this application runs on a dedicated server so there is no stuff of a casueal nature, however there is a lot of what I'll glom into the category of "support" tasks, ie ntp daemons, nscd flushing caches, fsflush running around backing up pages, etc. Was that what you meant? -Original Message- From: jonathan chew [mailto:[EMAIL PROTECTED] Sent: Thursday, September 01, 2005 12:45 PM To: David McDaniel (damcdani) Cc: Eric C. Saxe; perf-discuss@opensolaris.org Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior David McDaniel (damcdani) wrote: Thanks, Jonathon for the good insights. I'll be digging into the references you mentioned. Yes, at the end of the day I'm sure binding to processor sets is part of the plan; having already done so in a rather rote way I can demonstrate a very dramatic reduction in apparent cpu utilzation, on the order of 25-30%. But before I commit engineers to casting something in stone I want to make sure I understand the defaults and the side effects of doing so since it potentially results in defeating other improvements that Sun has don
Re: [perf-discuss] NUMA ptools and ISM segments
Joe Bonasera wrote: jonathan chew wrote: It sounds like Eric Lowe has a theory as to why madvise(MADV_ACCESS_*) and pmadvise(1) didn't work for migrating your ISM segment. Joe and Nils are experts on the x86/AMD64 HAT and may be able to comment on Eric's theory that the lack of dynamic ISM unmap is preventing page migration from working. Ok.. I've missed the start of this thread, having been distracted by a million other things just now.. Can someone update with the problem what Eric's theory was? I've attached Eric Lowe's email. The relevant part was the following: segspt_shmadvise() implements the NUMA migration policies for all shared memory types so next touch (MADV_ACCESS_LWP) should work. Unfortunately, the x64 HAT layer does not not implement dynamic ISM unmap. Since the NUMA migration code is driven by minor faults (the definition of next touch depends on that) I suspect that is why madvise() does not work for ISM on that machine. Jonathan --- Begin Message --- On Thu, Sep 15, 2005 at 11:55:21PM -0400, Marc Rocas wrote: | | The one question I have is whether ISM (Intimate Shared Memory) segments are | immune to being coerced to relocate via pmadvise(3c)? I've tried it without | success. A quick look at the seg_spt.c code seemed to indicate that when an | spt segment is created its lgroup policy is set to LGRP_MEM_POLICY_DEFAULT | that will result in randomized allocation for segments greater than 8MB. | I've verified as much using the NUMA enhanced pmap(3c) command. You are correct about the policy. Since shared memory is usually, uh, shared :) we spread it around to prevent hot-spotting. segspt_shmadvise() implements the NUMA migration policies for all shared memory types so next touch (MADV_ACCESS_LWP) should work. Unfortunately, the x64 HAT layer does not not implement dynamic ISM unmap. Since the NUMA migration code is driven by minor faults (the definition of next touch depends on that) I suspect that is why madvise() does not work for ISM on that machine. | I have an ISM segment that gets consumed by HW that is limited to 32-bit | addressing and thus have a need to control the physical range that backs the | segment. At this point, it would seem that I need to allocate the memory | (about 300MB) in the kernel and map it back to user-land but I would lose | the use of 2MB pages since I have not quite figured out how to allocate | memory using a particular page size. Have I misinterpreted the code? Do I | have other options? The twisted maze of code for ISM allocates its memory (ironically, since it's always locked and hence doesn't need swap) through anon. There is no way to restrict the segment to a specific PA range or otherwise impact the allocation path until the pages are already (F_SOFTLOCK) faulted in. The NUMA code might help in some cases because the first lgrp happens to fall under 4G :) so you can probably hack your way through this at the application layer by changing the shared memory random threshold to ULONG_MAX in /etc/system, binding your application thread to a CPU in the correct lgroup which has memory only below 4G, and then doing the shmat() from there. It's a ginormous hack, but it will get you the results you want. :) Once sysV shared memory is redesigned to not rely on the anon layer doing this sort of thing in the kernel should become a lot easier. Being able to specify where user memory ends up in PA space to avoid copying it on I/O is an RFE we simply have never thought about before. The lgroup code is careful to avoid specifics like memory addresses, since lgrp-physical mappings are very machine specific, so making this sort of thing work would require adding a whole new set of segment advice ops which are used by the physical memory allocator itself; page_create_va() takes the segment as one of its arguments, so if we stuffed the segment PA range advice into the segment we could dig it up down there and request memory in the "correct" range from freelists. -- Eric Lowe Solaris Kernel Development Austin, Texas Sun Microsystems. We make the net work.x64155/+1(512)401-1155 ___ perf-discuss mailing list perf-discuss@opensolaris.org --- End Message --- ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] NUMA ptools and ISM segments
Joe Bonasera wrote: The x86 HAT (D)ISM code is tricky. Here's why: pagesize is 4K, a large page is 2Meg (usually). So a pagetable covers either an entire 512 4K pages aligned at 2M or 512 2M pages aligned at 1Gig. To share a page table, the (D)ISM segment has to be either a multiple of (512) 4K pages aligned to a 2Meg boundar or a multiple of 512 2Meg pages aligned to a 1Gig boundary. When those alignment / size restrictions aren't met, then we can't really share the pagetables. So when you attach to the (d)ISM segment, we copy the values needed for the mappings into process local pagetables - ie. not really shared. Thanks for pointing out the alignment and size restrictions for (D)ISM segments. If you specify 0 for the address to shmat(2) and use a multiple of 2 meg or 1 gig for the size, should you get a segment that is properly aligned? I suspect the minor fault stuff doesn't work right, because to unload the mappings you'd have to hat_unshare() in all processes that have the DISM mapped in order to migrate it. Is hat_pageunload() sufficient? Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] NUMA ptools and ISM segments
Marc Rocas wrote: All, I ended up creating a DISM segment, migrating the creating thread to the lgrp containing cpu0, madvising, touching the memory, and MC_LOCK'ing per Jonathan's suggestions and it all worked out. The memory was exclusively allocated from lgrp 1 (cpu0) using 2M pages per the output of 'pmap -Ls $pid'. To not completely hard-code cpu0, I allocated a 32MB ISM segment and walk it page by page looking for PA that is within the first 4GB and keep track of what lgrp it comes from to generate a list of potential lgrps to use and pick the one that has the most free memory to allocate from. Once I've collected the information I destroy the ISM segment. In my case, it turns out to be cpu0 as it is an Opteron box and as Jonathan pointed out previously the lower PA range is associated with cpu0. The discovery algorithm is not perfect as it depends on ISM segments being randomized but hopefully that's one aspect of the implementation that will not change anytime soon. There may be a slightly better way to allocate memory from the lgroup containing the least significant physical memory using lgrp_affinity_set(3LGRP), meminfo(2), and madvise(MADV_ACCESS_LWP). Unfortunately, none of these ways are guaranteed to always allocate memory in the least significant 4 gigabytes of physical memory. For example, if the node with the least significant physical memory contains more than 4 gigabytes of RAM, physical memory may come from this node, but be above the least significant 4 gigabytes. :-( I spoke to one of our I/O guys to see whether there is a prescribed way to allocate physical memory in the least significant 4 gig for DMA from userland. Solaris doesn't provide an existing way. The philosophy is that users shouldn't have to care and that the I/O buffer will be copied to/from the least significant physical memory inside the kernel if the device can only reach that far to DMA. I think that you may be able to write a driver that can allocate physical memory in the desired range and allow access to it with mmap(2) or ioctl(2). I'd like to understand more about your situation to get a better idea of the constraints and whether there is a viable solution or the need for one. What is your device? Can you afford to let the kernel copy the user's buffer into another buffer which is below 4 gig to DMA to your device? If not, would it make sense to write your own driver? By the way, the lgrp API and meminfo(2) and the information provided by everyone made it pretty easy. Thanks a bunch. Once we start playing around with our performance suites I may have further feedback on the tools and or questions. I'm glad that you found the lgroup APIs and meminfo(2) easy to use. Please let us know whether you have any more questions or feedback. Jonathan On 9/16/05, *Marc Rocas* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: Eric, Bart, and Jonathan, Thanks for your quick replies. See my comments below: On 9/16/05, *jonathan chew* < [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: Marc Rocas wrote: I've been playing around with the tools on a Stinger box and I think their pretty cool! I'm glad that you like them. We like them too and think that they are fun to play with besides being useful for observability and experimenting with performance. I hope that our MPO overview, man pages, and our examples on how to use the tools were helpful. All of the above were pretty useful in getting up to speed and I will use them to do performance experiments. I simply need to bring up our system in the stinger box in order to run our perf suites. The one question I have is whether ISM (Intimate Shared Memory) segments are immune to being coerced to relocate via pmadvise(3c)? I've tried it without success. A quick look at the seg_spt.c code seemed to indicate that when an spt segment is created its lgroup policy is set to LGRP_MEM_POLICY_DEFAULT that will result in randomized allocation for segments greater than 8MB. I've verified as much using the NUMA enhanced pmap(3c) command. It sounds like Eric Lowe has a theory as to why madvise(MADV_ACCESS_*) and pmadvise(1) didn't work for migrating your ISM segment. Joe and Nils are experts on the x86/AMD64 HAT and may be able to comment on Eric's theory that the lack of dynamic ISM unmap is preventing page migration from working. I'll see whether I can reproduce the problem. Which version of Solaris are you using? Solaris 10 GA with no kernel patches applied. I have an ISM segment that gets consumed by HW that is limited to 32-bit addressing and thus have a need
Re: [perf-discuss] NUMA ptools and ISM segments
Marc Rocas wrote On 09/27/05 21:27,: > > On 9/26/05, *jonathan chew* <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > > There may be a slightly better way to allocate memory from the lgroup > containing the least significant physical memory using > lgrp_affinity_set(3LGRP), meminfo(2), and madvise(MADV_ACCESS_LWP). > > Unfortunately, none of these ways are guaranteed to always allocate > memory in the least significant 4 gigabytes of physical memory. For > example, if the node with the least significant physical memory contains > more than 4 gigabytes of RAM, physical memory may come from this node, > but be above the least significant 4 gigabytes. :-( > > > Fortunately, we control the HW our systems are deployed on and can > enforce a 4GB RAM limit if required. Ok. That definitely makes things easier. > I spoke to one of our I/O guys to see whether there is a prescribed way > to allocate physical memory in the least significant 4 gig for DMA from > userland. Solaris doesn't provide an existing way. The philosophy is > that users shouldn't have to care and that the I/O buffer will be copied > to/from the least significant physical memory inside the kernel if the > device can only reach that far to DMA. I think that you may be able to > write a driver that can allocate physical memory in the desired range > and allow access to it with mmap(2) or ioctl(2). > > > We already have such a driver but have not found a way to force the use > of 2M pages! Is there a new DDI interface to request large page size? I'm not sure, but can try to find out. What does the driver use to allocate the memory (eg. ddi_dma_mem_alloc(9F))? > I'd like to understand more about your situation to get a better idea of > the constraints and whether there is a viable solution or the need > for one. > > What is your device? Can you afford to let the kernel copy the user's > buffer into another buffer which is below 4 gig to DMA to your device? > If not, would it make sense to write your own driver? > > > Not really. We buffer data up and need to have it DMA in real-time to > our device which futher processes it and passes on the processed data to > another machine. The window of time once we have committed to delivering > the data is strictly enforced and failure to do so effectively shuts > down the other system. Way back in SunOS 5.4, we went as far as writing > a pseudo driver to SOFTLOCK memory as we found that it was not enough to > mlock() memory since we took page faults on the corresponding TTEs. As I > noted previously in the beginning of this thread, we use our own version > of physio() that assumes properly wired down memory and thus differs > from the stock version in that it does not bother with the locking logic > at all. Ok. I see. > By writing our own device driver, do you mean one to export 4GB PA range > to user-land or our own segment driver? I mean whether your driver can allocate the DMA buffer in the right place and provide acccess to it through mmap(2) or ioctl(2). It sounds like you have a driver like that already, but want the buffer to be on large pages in the user application. Is that right? Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
[perf-discuss] MPO design review
I have to give a presentation to update our Platform Software Architecture Review Committee (PSARC) on MPO and thought that this might be a good time to let others in the OpenSolaris (performance) community to comment too. The presentation is based on the MPO overview in: http://www.opensolaris.org/os/community/performance/mpo_overview.pdf If you have seen that already, this presentation has an issues and improvements slide for lgroups, initial placement, scheduling, VM, APIs, and tools and then has some new slides on the issues affecting all of MPO and on future work. This *new* presentation is posted at: http://www.opensolaris.org/os/community/performance/numa/mpo_update.pdf or http://www.opensolaris.org/os/community/performance/numa/mpo_update.sxi NOTE: It contains some pictures that older PDF readers may not display correctly, so I have included the StarOffice version which does display correctly just in case. I welcome any questions, comments, or feedback regarding the past, present, or future of MPO and its design and implementation. Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] lgrp kstats
Henrik Loef wrote: Hi, I'm trying to repeat some old experiments I did a couple of months ago an a SF15K machine. We used kstats (module lgrp: "pages migrated to","pages migrated from") to measure the number of pages migrated using madvise(). The domain we used for our previous experiments has been shut-down so now we are using another domain where these kstats seem to be deactivated. Does anybody know for which releases of Solaris these counters are enabled? Note: we were involved in the beta program so the release we used might have been a beta. I think that the lgroup kstats for pages migrated were integrated into Solaris with the changes for madvise(MADV_ACCESS_*) in Solaris 9 Update 2. Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] Project sponsorship request: Tesla
Eric Saxe wrote: I'd like the ask the OpenSolaris performance community for sponsorship of Project Tesla. http://www.opensolaris.org/os/project/tesla The Tesla project seeks to provide OpenSolaris with a platform independent power management policy architecture, bringing power awareness to various kernel subsystems (including the dispatcher) that interact with power manageable system resources. I believe the performance community should have an interest in this project, since one of the project's primary goals is implementation of a "default" policy that seeks to deliver maximum system performance while consuming no more power than is necessary to do so. Yes, I think that we should definitely sponsor this. Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] Project proposal: "Solaris Enhancements for AMD-based Platforms"
Ostrovsky, Boris wrote: > I would like to propose creation of a new project titled "Solaris > Enhancements for AMD-based Platforms". > > The project will address various features that are specific to platforms > based on AMD processors, such as > - IOMMU support > - NUMA topology, particularly how it affects IO performance > - Observability (performance counters, instruction-based > sampling) > - Power management > - RAS features > - New instruction support, new CPUID features > > Since what's described above covers fairly diverse range of subjects, > the proposed project will serve as an umbrella for sub-projects, each of > them covering a particular area related to improving Solaris behavior on > systems built around AMD processors (as well as chipsets and graphics > components) > > I think this project would be of interest to a number of OpenSolaris > communities but I am asking the performance community for sponsorship as > it appears to be the most relevant. > +1 from me too. Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] NUMA and interconnect transfers
Rafael Vanoni wrote: > Hey everyone > > Is the kernel aware of the status of the interconnect between different > NUMA nodes ? > No, not currently. It just assumes that there is some interconnect between the nodes and may know the latency between them when the system is not loaded. > For instance, when an app is transferring data from one node to another, > saturating the interconnect between these two. Are we aware of this ? > No. > I haven't seen any code that collects telemetry from the interconnect > hardware. It sounds like something that we should be looking at. > There may be hardware performance counters that provide some observability for this. I'm not sure though and it definitely would depend on the processor and platform. This is definitely something that I have been thinking about recently. I do think that it may be useful since we want to be able to use machines efficiently (eg. for power, etc.) and provide good performance by default even as the hardware is becoming more sophisicated/complicated with CMT and NUMA. Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] NUMA and interconnect transfers
Rayson Ho wrote: > On Dec 11, 2007 5:53 PM, Jonathan Chew <[EMAIL PROTECTED]> wrote: > >> No, not currently. It just assumes that there is some interconnect >> between the nodes and may know the latency between them when the system >> is not loaded. >> > > Jonathan, I just found out that you are the author of the MPO (Memory > Placement Optimization) presentation: > > http://www.opensolaris.org/os/community/performance/mpo_overview.pdf > Yes. If you're interested, there is also an update to that presentation at: http://www.opensolaris.org/os/community/performance/numa/mpo_update.pdf >> There may be hardware performance counters that provide some >> observability for this. I'm not sure though and it definitely would >> depend on the processor and platform. >> > > I was looking at the same thing too, but from an application > developer's perspective. I then talked to my AMD friend, and he told > me that CodeAnalyst provides a way to measure the HT bandwidth: > > "HyperTransport link x transmit bandwidth" -- HT links 0, 1, and > 2 (i.e., whether they're used for cache-coherence inter-processor > traffic, or used for non-coherent I/O) > > And I read this document a while ago: > "Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA > Multiprocessor Systems Application Note" > http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40555.pdf > Thanks for the pointers. I had the funny feeling that Opteron would have counters for this, but I'm not so sure about SPARC and what will be available on Intel's Nehalem processors with QuickPath. Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] NUMA and interconnect transfers
Rayson Ho wrote: > On Dec 11, 2007 9:33 PM, Jonathan Chew <[EMAIL PROTECTED]> wrote: > >> Yes. If you're interested, there is also an update to that presentation at >> > > I think I've read that one too, and the chapter in the Solaris > Internals book... all 3 are interesting :-D > > > >> Thanks for the pointers. I had the funny feeling that Opteron would >> have counters for this, but I'm not so sure about SPARC and what will be >> available on Intel's Nehalem processors with QuickPath. >> > > Don't know about what QuickPath will have (hmm... "QuickPath" seems to > be new to me, as I was so used to the old "CSI" name). However, as it > will support Nehalem as well as Tukwila, it may be interesting to see > what is currently available for Itanium-2: > Yes, that seems likely. I believe that Nehalem may very well inherit some CPU hardware performance counters, etc. from existing Intel processors. > ... we use the PMU to capture two different types of memory access > data -- long latency loads and data translation lookaside buffer > (DTLB) misses -- section 3: Profile Generation > > "Hardware Profile-guided Automatic Page Placement for ccNUMA Systems" > http://moss.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/ppopp06.pdf > > And the Gelato ICE presentation: > http://www.ice.gelato.org/apr07/pres_pdf/gelato_ICE07apr_pageplacement_mueller_ncsu.pdf > Thanks for the pointer to the paper and presentation. I only glanced through them, but they seem very similiar in spirit to page migration work that was done in the 1990's by some friends at Rochester, Stanford, and SGI. In fact, I thought that IRIX had some sort of automatic page migration which may have used hardware performance counters to help decide what to migrate where. Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] Contributor nomination for Ashok Raj
Sherry Moore wrote: > I'd like to nominate Ashok Raj for a contributor grant in our community. > > Ashok has made (and continues to make) tremendous contributions to the > perf community sponsored OpenSolaris Intel-platform Project. He not > only provides code drops for new features and performance improvements, > but also participates in design reviews and code reviews. The projects > and RFEs he contributed to include (but not limited to) > - Intel Microcode Update Support > - monitor/mwait implementation > - CPUID support for Penryn and Nehalem > and many more. > > He is an active participant on various OpenSolaris discussion lists, > and has become a core member of the Intel-platform project team. > Recognition for his contributions is also long overdue. :) > +1 Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] Contributor nomination for Robert Kasten
Bill Holler wrote: > I'd like to nominate Robert Kasten for a contributor grant in our community. > > Robert has made (and continues to make) tremendous contributions to the > perf community sponsored OpenSolaris Intel-platform Project. > He provides code drops for performance improvements and participates in > design reviews and code reviews. Robert also mentors other engineers > working on Intel/Sun performance projects. The projects and RFEs he > contributed to include (but not limited to) > - libc > - kernel copy primitives > - CPUID support and bug fixes for Core 2 Duo, Penryn and Nehalem > and many more. > > He is an active participant on various OpenSolaris discussion lists, > and has become a core member of the Intel-platform project team. > Recognition for his contributions is also long overdue. :) > +1 Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] core contributer nomination for Sherry Moore
Eric Saxe wrote: > I'd like to nominate Sherry Moore for a core contributer grant in our > community. > > Sherry is the tech lead for the community sponsored "Enable/Enhance > Solaris support for Intel Platform" project, and is a Senior Staff > Engineer in the Solaris Kernel group. Her contributions to Solaris / > OpenSolaris performance have been many, but to list a few: > > - Reducing Cache Pollution with Non-temporal Access, which improved > write performance by 80-120% on x86 platforms. > - Enabled ON compilation with Sun Studio 10 which improved build time. > - Fast Reboot (ongoing) From "reboot" command to banner in 5 seconds. > > Other notable accomplishments: > > - Intel Microcode Update Support > - "save-args" on amd64, so that we can have the stack arguments > available via the debugger > - Worked on the port of Solaris to amd64 > - Worked extensively on Solaris Dynamic Reconfiguration (DR) support > +1 Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] contributer nomination for Aubrey Li
Eric Saxe wrote: > I'd like to nominate Aubrey Li for a contributer grant in our community. > > Aubrey has made (and continues to make) tremendous contributions to the > perf community sponsored OpenSolaris Tesla project though his code and > design contributions to the OpenSolaris PowerTop port, deep C-state > support, and more. He is an active participant on various OpenSolaris > discussion lists, and has become a core member of the Tesla project > team. Recognition for his contributions is long overdue. :) > +1 Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] Core Contributer nominations for Jim Mauro, Adrian Cockcroft
Eric Saxe wrote: > Great suggestions. Richard is already a core contributer in the community. > I would like to nominate Jim Mauro and Adrian Cockcroft for core > contributer grants. > > Both Adrian and Jim have long standing track records of great > contributions in the area of Solaris performance, and in fact have > written several books about it. :) > +1 Jonathan > > Brandorr wrote: > >> I am not a contributor to perf, but what about: >> >> Adrian >> Jim Mauro >> Richard McDougal >> >> for starters. >> >> -Brian >> >> P.S. - Please forgive spelling. >> >> On Tue, Feb 26, 2008 at 6:46 PM, Eric Saxe <[EMAIL PROTECTED]> wrote: >> >> >>> Greetings, >>> >>> You may have noticed that other communities have been (and are) defining >>> roles and grants as outlined in section 3.3 of the OpenSolaris >>> constitution, and the time has arrived for us (the performance >>> community) to do the same. This URL: >>> >>> http://www.opensolaris.org/os/community/performance/roles >>> >>> ...provides some background, and outlines a nomination and voting >>> process for contributer and core contributer grants in our community. >>> >>> Unfortunately, there simply isn't enough time for any new Core >>> Contributers grants to take effect (in terms of voting) for the upcoming >>> OpenSolaris election, but I believe it's still worth kicking off this >>> process now so that deserving folks in our community can be recognized >>> for their ongoing contributions to OpenSolaris performance. >>> >>> So with that, i'd like to open up nominations >>> >>> Thanks, >>> -Eric >>> >>> >>> >>> >>> >>> >>> >>> ___ >>> perf-discuss mailing list >>> perf-discuss@opensolaris.org >>> >>> >>> >> >> >> > > ___ > perf-discuss mailing list > perf-discuss@opensolaris.org > > ___ perf-discuss mailing list perf-discuss@opensolaris.org
[perf-discuss] CMT project
I would like to get sponsorship from the OpenSolaris performance community to host a CMT project which will focus on observability, performance enhancements, and potentially more in OpenSolaris for Chip Multi-Threaded (CMT) processors (including SMT, CMP, etc.). Specifically, the project will try to do the following in OpenSolaris: - Further develop a processor group abstraction for capturing the CMT processor sharing relationhips of performance relevant hardware components (eg. execution pipeline, cache, etc.) - Create an interface for determining which CPUs share what performance relevant hardware and the characteristics of these performance relevant hardware components - Add more performance optimizations to Solaris for CMT (eg. scheduling, I/O, etc.) - Improve load balancing for maximizing performance and potentially minimizing power consumption - Create APIs to facilitate performance optimizations for CMT - Make changes needed to make all of the above work well with virtualization - Improve upon the existing Solaris CMT enhancements - Add support for new CMT hardware as needed - Address any OpenSolaris CMT issues as they arise In the process of doing all of this, I'm hoping that the project will facilitate collaboration in this area as well as a better understanding and appreciation of CMT and OpenSolaris. Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
[perf-discuss] NUMA project
I would like to get sponsorship from the OpenSolaris performance community to host a NUMA project. The "Memory Placement Optimization" feature in Solaris has been around since Solaris 9 and has had web pages in the OpenSolaris performance community since it started before OpenSolaris projects existed formally. I would like to formalize its existence as project in the performance community since there is more work to be done for NUMA. Specifically, the project will try to do the following in OpenSolaris: - Make MPO aware of I/O device locality - Add observability of *kernel* thread and memory placement - Optimize kernel thread and memory placement to improve performance in general and for NUMA I/O - Add dynamic lgroup load balancing to improve performance and potentially minimize power consumption - Enhance MPO to work well with virtualization and vice versa - Add support for new NUMA machines - Improve the existing framework as needed - Address any OpenSolaris NUMA issues as they arise Also, formalizing NUMA as an OpenSolaris project will enable the project to have space on opensolaris.org for sharing code, ideas, etc. which I'm hoping will facilitate collaboration in this area as well as a better understanding and appreciation of NUMA and OpenSolaris. ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] CMT and NUMA proposals
Thanks for the pointers. Here are my revised proposals. Jonathan [EMAIL PROTECTED] wrote: Hi Jonathan, I'm in favor of both of these proposals. However, I think they're incomplete, at least according to the project instantiation guidelines. Would you amend these proposals to include the participants for each project, information about each project's mailing list, and the consolidation that the projects are targeting? The project instantiation has additional reccomendations that may be of interest here too: http://opensolaris.org/os/community/ogb/policies/project-instantiation.txt Thanks, -j I would like to get sponsorship from the OpenSolaris performance community to host a CMT project which will focus on observability, performance enhancements, and potentially more in OpenSolaris for Chip Multi-Threaded (CMT) processors (including SMT, CMP, etc.). This project will target the ON consolidation although it may affect others. Specifically, the project will try to do the following in OpenSolaris: - Further develop a processor group abstraction for capturing the CMT processor sharing relationhips of performance relevant hardware components (eg. execution pipeline, cache, etc.) - Create an interface for determining which CPUs share what performance relevant hardware and the characteristics of these performance relevant hardware components - Add more performance optimizations to Solaris for CMT (eg. scheduling, I/O, etc.) - Improve load balancing for maximizing performance and potentially minimizing power consumption - Create APIs to facilitate performance optimizations for CMT - Make changes needed to make all of the above work well with virtualization - Improve upon the existing Solaris CMT enhancements - Add support for new CMT hardware as needed - Address any OpenSolaris CMT issues as they arise In the process of doing all of this, I'm hoping that the project will facilitate collaboration in this area as well as a better understanding and appreciation of CMT and OpenSolaris. The initial project team members are: - [EMAIL PROTECTED] - [EMAIL PROTECTED] (project leader) - [EMAIL PROTECTED] - [EMAIL PROTECTED] - [EMAIL PROTECTED] I would like to get sponsorship from the OpenSolaris performance community to host a NUMA project. This project will target the Operating System/Networking (ON) consolidation although it may affect others. The "Memory Placement Optimization" feature in Solaris has been around since Solaris 9 and has had web pages in the OpenSolaris performance community since it started before OpenSolaris projects existed formally (see http://opensolaris.org/os/community/performance/numa/ for more info). I would like to formalize its existence as project in the performance community since there is more work to be done for NUMA. Specifically, the project will try to do the following in OpenSolaris: - Make MPO aware of I/O device locality - Add observability of *kernel* thread and memory placement - Optimize kernel thread and memory placement to improve performance in general and for NUMA I/O - Add dynamic lgroup load balancing to improve performance and potentially minimize power consumption - Enhance MPO to work well with virtualization and vice versa - Add support for new NUMA machines - Improve the existing framework as needed - Address any OpenSolaris NUMA issues as they arise Also, formalizing NUMA as an OpenSolaris project will enable the project to have space on opensolaris.org for sharing code, ideas, etc. which I'm hoping will facilitate collaboration in this area as well as a better understanding and appreciation of NUMA and OpenSolaris. The initial project team members are: - [EMAIL PROTECTED] - [EMAIL PROTECTED] (project leader) - [EMAIL PROTECTED] - [EMAIL PROTECTED] - [EMAIL PROTECTED] - [EMAIL PROTECTED] ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] CMT and NUMA proposals
[EMAIL PROTECTED] wrote: > A plus one to both. > > I realize it may not be feasible now, but long term it would be great if > we could move the sun-internal e-mail lists that exist around these > projects into the open. If we get enough projects using perf-discuss > as a default mailing list it's going to be confusing for everyone. > It is the *goal* for us to get our internal Sun project aliases as well as our development out into opensolaris.org. We can definitely explore making separate OpenSolaris email aliases for CMT and NUMA once the projects are approved > Thanks for taking the time to get these projects proposed and into > OpenSolaris. > Sure. No problem. It is long overdue. Jonathan > On Fri, Aug 15, 2008 at 03:22:10PM -0700, Jonathan Chew wrote: > >> Thanks for the pointers. >> >> Here are my revised proposals. >> >> >> Jonathan >> >> >> [EMAIL PROTECTED] wrote: >> >>> Hi Jonathan, >>> I'm in favor of both of these proposals. However, I think they're >>> incomplete, at least according to the project instantiation guidelines. >>> >>> Would you amend these proposals to include the participants for each >>> project, information about each project's mailing list, and the >>> consolidation that the projects are targeting? >>> >>> The project instantiation has additional reccomendations that may be of >>> interest here too: >>> >>> http://opensolaris.org/os/community/ogb/policies/project-instantiation.txt >>> >>> Thanks, >>> >>> -j >>> >>> >>> >>> > > >> I would like to get sponsorship from the OpenSolaris performance community >> to host a CMT project which will focus on observability, performance >> enhancements, and potentially more in OpenSolaris for Chip Multi-Threaded >> (CMT) processors (including SMT, CMP, etc.). This project will target the >> ON consolidation although it may affect others. >> >> Specifically, the project will try to do the following in OpenSolaris: >> >> - Further develop a processor group abstraction for capturing the CMT >> processor sharing relationhips of performance relevant hardware components >> (eg. execution pipeline, cache, etc.) >> >> - Create an interface for determining which CPUs share what performance >> relevant hardware and the characteristics of these performance relevant >> hardware components >> >> - Add more performance optimizations to Solaris for CMT (eg. scheduling, >> I/O, etc.) >> >> - Improve load balancing for maximizing performance and potentially >> minimizing power consumption >> >> - Create APIs to facilitate performance optimizations for CMT >> >> - Make changes needed to make all of the above work well with virtualization >> >> - Improve upon the existing Solaris CMT enhancements >> >> - Add support for new CMT hardware as needed >> >> - Address any OpenSolaris CMT issues as they arise >> >> >> In the process of doing all of this, I'm hoping that the project will >> facilitate collaboration in this area as well as a better understanding and >> appreciation of CMT and OpenSolaris. >> >> The initial project team members are: >> >> - [EMAIL PROTECTED] >> >> - [EMAIL PROTECTED] (project leader) >> >> - [EMAIL PROTECTED] >> >> - [EMAIL PROTECTED] >> >> - [EMAIL PROTECTED] >> > > >> I would like to get sponsorship from the OpenSolaris performance community >> to host a NUMA project. This project will target the Operating >> System/Networking (ON) consolidation although it may affect others. >> >> The "Memory Placement Optimization" feature in Solaris has been around since >> Solaris 9 and has had web pages in the OpenSolaris performance community >> since it started before OpenSolaris projects existed formally (see >> http://opensolaris.org/os/community/performance/numa/ for more info). >> >> I would like to formalize its existence as project in the performance >> community since there is more work to be done for NUMA. Specifically, the >> project will try to do the following in OpenSolaris: >> >> - Make MPO aware of I/O device locality >> >> - Add observability of *kernel* thread and memory placement >> >> - Optimize kernel thread and memory placement to improve performance in >> general and for NUMA I/O >> >> - Add dynamic lgroup load
Re: [perf-discuss] CMT project
Elad Lahav wrote: > Hi Jonathan, > > I am currently looking into different affinity policies on the Niagara > architecture, including the application threads and interrupt handlers. We are interested in affinity scheduling too. > An important aspect of this work is the hierarchy of processing units > (chips, cores, threads) with respect to cache sharing, resource > utilisation, etc. I believe this quite relevant to what you have in mind. I'm not sure whether you know much about what we have available in OpenSolaris already, but we have a "Processor Group (PG)" abstraction inside the kernel for keeping track of which CPUs share performance relevant hardware components (eg. execution pipeline, cache, etc.) now. The Processor Groups are organized into a hierarchy (tree) where each leaf corresponds to each CPU (eg. strand) and its ancestors are groups of processors ordered from the ones that share the most with the CPU (eg. CPUs that share execution pipeline with the leaf CPU) to the ones that share the least (eg. all the CPUs in the machine). > I'll be more than willing to co-operate with you, or anyone else in > the OpenSolaris community, on this subject. That sounds great! Can you please tell me more about your work or point me at any relevant papers or docs? Jonathan > > > Jonathan Chew wrote: >> I would like to get sponsorship from the OpenSolaris performance >> community to host a CMT project which will focus on observability, >> performance enhancements, and potentially more in OpenSolaris for >> Chip Multi-Threaded (CMT) processors (including SMT, CMP, etc.). >> >> Specifically, the project will try to do the following in OpenSolaris: >> >> - Further develop a processor group abstraction for capturing the CMT >> processor sharing relationhips of performance relevant hardware >> components (eg. execution pipeline, cache, etc.) >> >> - Create an interface for determining which CPUs share what >> performance relevant hardware and the characteristics of these >> performance relevant hardware components >> >> - Add more performance optimizations to Solaris for CMT (eg. >> scheduling, I/O, etc.) >> >> - Improve load balancing for maximizing performance and potentially >> minimizing power consumption >> >> - Create APIs to facilitate performance optimizations for CMT >> >> - Make changes needed to make all of the above work well with >> virtualization >> >> - Improve upon the existing Solaris CMT enhancements >> >> - Add support for new CMT hardware as needed >> >> - Address any OpenSolaris CMT issues as they arise >> >> >> In the process of doing all of this, I'm hoping that the project will >> facilitate collaboration in this area as well as a better >> understanding and appreciation of CMT and OpenSolaris. >> >> >> Jonathan >> >> >> ___ >> perf-discuss mailing list >> perf-discuss@opensolaris.org > > ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] Q: Rehome thread to new lgroup?
Manfred Mücke wrote: > Hi, > > I want to enforce migration of thread+allocated memory to another lgroup but > failed for some threads. > Ok. You can give MADV_ACCESS_LWP as the advice to madvise(3C) for the memory that you want to migrate. This advice tells the Solaris kernel that the next thread to touch the memory will use it a lot and the kernel will migrate the memory near the thread that next touches that memory (ie. in or near the thread's home lgroup). If you don't want to change your application to do this, you can use pmadvise(1) instead, specifying the same advice for your specified virtual memory. > I understand from the "Memory and Thread Placement Optimization Developer's > Guide", that there is no hard memory affinity in Solaris, but only the > possibility to define some level of affinity to the home lgroup. Therefore, > prior to any kind of memory migration it seems necessary to move the home > lgroup to the lgroup where the memory should migrate to. The lgrp_home(3LGRP) > man page defines the "home lgroup" of a thread as the "lgroup with the > strongest affinity that the thread can run on". > > A sequence for memory migration was already lined out in > http://www.opensolaris.org/jive/thread.jspa?messageID=14792: > > - Change the thread in questions home lgroup to the CPU where you want the > memory allocated, > - use 'pmadvise -o heap=lwp_access' on the process > - the memory should now get allocated in the newly homed lgrp (check with > 'pmap -L' and lgrpinfo (if the allocation is large enough to notice)). > - rehome the lgrp to another CPU or just bind it. > > I tried, but for some threads, I got surprising results: > > >> ./plgrp -a all 27215 >> > PID/LWPIDHOME AFFINITY >27215/12 5/strong,0-4,6-24/none > > If the home lgroup is defined as the lgroup with the strongest affinity, > isn't the output above somewhat contradictory? > Yes, this means that the home lgroup of the specified thread is 2 even though it has a strong affinity for lgroup 5. >> ./plgrp -F -H 5 27215 >> > PID/LWPIDHOME >27215/12 => 2 > > This thread seems to be resistant to plgrp's attempts to assign a new home > lgroup (option -H) to it. > > This is a test program. For some other threads, I was able to freely rehome > them to any lgroup in the range 1..8 (the leaf lgroups). Has anyone some > further suggestions on what could prevent reassigning the home lgroup? > If the thread is bound to a CPU that isn't in lgroup 5 or bound to a processor set that does not overlap lgroup 5, then the thread cannot be rehomed to lgroup 5 even though it may have a strong affinity for lgroup 5. You should be able to use pbind(1M) and psrset(1M) to see whether your thread is bound. > My system is a Sun Fire X4600 with eight dual-core Opterons. Lgroups 1..8 are > identical (i.e. same amount of memory and CPU resources per lgroup), except > for the CPU IDs: > >> lgrpinfo.pl -a >> > [..] > lgroup 1 (leaf): > Children: none, Parent: 9 > CPUs: 0 1 > Memory: installed 3584 Mb, allocated 985 Mb, free 2599 Mb > Lgroup resources: 1 (CPU); 1 (memory) > Load: 0.00723 > Latency: 51 > [..] > If you're still having problems after trying what I suggested above, you should include *all* of the output from "lgrpinfo -Ta" so we can see where lgroup 5 is in the topology and what it contains. Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] capacity planning tools - status
Stefan Parvu wrote: > >> For papers and books on this, start with the >> Teamquest site and view their webinar >> http://teamquest.com/resources/webinars/display/30/index.htm >> and if you don't mind math, follow up with Neil >> Gunther's site, http://www.perfdynamics.com/ >> > > Right. I have been in GCaP class, kept by Neil Gunther > and my questions actually come part of the feedback from the class, > part from my side: > > - corestat integrated into Solaris. Customers as well asked about > this and why is not integrated into (Open)Solaris. > corestat isn't integrated into Solaris because it hasn't been productized and has some issues (eg. must be run as root, prevents anyone else from using the CPU performance counters while it is running, only works for Niagara processors, may not be sufficient for doing capacity planning and performance analysis, etc.). I am investigating making something better. Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] Expiring Core Contributor Grants
Eric Saxe wrote: > johan...@sun.com wrote: > >> I'm going to begin this process by voting in favor of renewing akolb, >> barts, esaxe, johansen, jjc, mpogue, and dp. If andrei or rmc are still >> interested in participating, I would support renewing their grants. I >> simply haven't seen any traffic from them on our lists lately. It's >> possible this is simply an oversight on my part, and if so I apologize. >> >> > +1 as well for akolb, barts, esaxe, johansen, jjc, mpogue, and dp. I > don't know if andrei or rmc are still interested in participating. If > they are, I would support renewal as well. > +1. Me too. Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] Contributer grant nomination for Chad Mynheir
+1 Jonathan Eric Saxe wrote: > I'd like to put forth a contributer grant nomination for Chad Mynhier, > for his recent contributions in the area of performance observability, > including contributing fixes for: > > PSARC 2007/598 ptime(1) Improvements > 6234106 ptime should report microseconds or nanoseconds instead of just > milliseconds > 4532599 ptime should report resource usage statistics from /proc for running > processes > 6749441 intrstat(1M) shows zeroed values after suspend/resume > 6786517 /usr/demo/dtrace/qtime.d tracks enqueue/dequeue times incorrectly > 6738982 Representative thread after DTrace stop() action is incorrect > 6737974 DTrace probe module globbing doesn't work > 6730287 tst/common/printf/tst.str.d needs to be updated for ZFS boot > > > He quite clearly has been making significant contributions, and IMHO > recognition for his contributions is well deserved. > (Thank you Chad) > > -Eric > ___ > perf-discuss mailing list > perf-discuss@opensolaris.org > ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] Expiring Core Contributor Grants
Eric Saxe wrote: Eric Saxe wrote: Also, in the past I've acted as de facto facilitator for this community...but we've never had a vote around that (at least not one that I remember). Are you interested Krister? If so, I nominate you. :) Krister has accepted my nomination for community facilitator. All those in favor?... +1 Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] V890 with US-IV+ : which core should interrupts be delivered to?
johan...@sun.com wrote: On Fri, Jun 19, 2009 at 05:52:02PM +0200, Nils Goroll wrote: I am trying to reduce context switches on a V890 with 16 cores by delegating one core to interrupt handling (by setting all others to nointr using psradm). The best way to do this is to create processor sets, with the set that handles OS tasks containing processors that handle interrupts and the other set containining the nointr processors. If you're really concerned about context switching, you might also benefit from processor binding, which only lets a process run on a particular cpu. Lately, the scheduler guys have made a bunch of improvements that allow processes to be more sticky, though. (Less likely to migrate from one CPU to another, but that is different from a context switch) What version of the OS are you running? Does the hardware design of this machine imply that (a) particular core(s) is best suited for this task? Almost any multiprocessor machine that has NUMA characteristics will benefit from having the interrupts placed on the CPU that's closest to the hardware that connects to the bus that the interrupts are coming from. From a discussion long ago, I remember the v890 has some NUMA characteristics, but I don't honestly know if we treat it like a NUMA machine. Jonathan Chew, who's lurking on the list somewhere, might have more details. Lurking?! Who me? I'm innocent. ;-) I think that Paul already replied and explained that the V890 doesn't have NUMA I/O, so it doesn't matter which CPU is used for the interrupt with respect to the device for NUMA. However, it may matter where you assign the interrupt for NUMA memory locality and CMT cache sharing (since each UltraSPARC IV+ processor chip has two single stranded cores that share cache) If there are (user or kernel) threads that are involved in doing I/O to/from the device, they may benefit from sharing (or not sharing) cache and/or local memory with the interrupt thread and CPU where the interrupt is assigned. For example, the Crossbow poll thread will benefit from being bound to the same CPU as the interrupt or the other CPU sharing cache with the interrupt thread/CPU and the Crossbow soft ring threads will at least benefit from sharing local memory with the interrupt thread/CPU if you're doing network I/O on OpenSolaris. You also should be careful what threads are placed on the CPU sharing cache with the interrupt CPU and thread because the threads may interfere with each other's cache utilization. Your user application threads may benefit from running close or far away from the interrupt thread/CPU depending on what the threads do. Thus, how you place your interrupt and your threads and whether you use processor sets, processor binding, etc. will depend on your workload and what you are trying to optimize. Hope that helps. Jonathan PS Now I'll go back to lurking/sleeping. ;-) ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] madvise() and "heap" memory
johan...@sun.com wrote: On Thu, Jul 09, 2009 at 12:18:17PM -0500, Bob Friesenhahn wrote: Do madvise() options like MADV_ACCESS_LWP and MADV_ACCESS_MANY work on memory allocated via malloc()? If MADV_ACCESS_LWP is specified and malloc() hands out heap memory which has been used before (e.g. by some other LWP), is the memory data in the range moved (physically copied and remapped to the same address via VM subsystem) closest to the CPU core which next touches it? Currently, segvn is the only segment driver with support for madvise opertaions; This part isn't exactly right. madvise(3C) is also supported by the segspt virtual memory segment driver for Intimate Shared Memory (ISM) along with segvn. Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] NUMAtop for OpenSolaris
There has been a lot of discussion on this since it was proposed last month. I want to know what is currently being proposed given the lengthy discussion. Can someone please summarize what the current proposal is now? Jonathan Li, Aubrey wrote: johansen wrote: On Tue, Jan 12, 2010 at 02:20:02PM +0800, zhihui Chen wrote: Application can be categoried into CPU-sensitive, Memory-sensitive, IO-sensitive. My concern here is that unless the customer knows how to determine whether his application is CPU, memory, or IO sensitive it's going to be hard to use the tools well. "sysload" in NUMAtop can tell the customer if the app is cpu sensitive. "Last Level Cache Miss per Instruction" will be added into NUMAtop to determine if the app is memory sensitive. When CPU trigged one LLC miss, the data can be gotten from local memory, cache or memory in remote node. Generlly, the latency for local memory will be close to latency for remote cache, while latency for remote memory should be much higher. This isn't universally true. On some SPARC platforms, it actually takes longer to read a line out of a remote CPU's cache than it does to access the memory on a remote system board. On a large system, many CPU's may have this address in their cache, and they all need to know that it has become owned by the reading CPU. If you're going to make this tool work on SPARC, it won't always be safe to make this assumption. -j Thanks to point this issue out. We are not SPARC expert and I think SPARC NUMAtop design is not in our phase I design, :) We hope the SPARC expert like you or other expert can take SPARC into account and extend this tool onto SPARC platform. On systems where some remote memory accesses take longer than others, this could be especially useful. Instead of just reporting the number of remote accesses, it would be useful to report the amount of time the application spent accessing that memory. Then it's possible for the user to figure out what kind of performance win they might achieve by making the memory accesses local. As for the metric of NUMAtop, the memory access latency is a good idea. But the absolute amount is not a good indicator for NUMAtop. This amount will be different on different platforms, a specific number of amount is good on one platform while it's bad on another one. It's hard to tell the customer what data is good. So we will introduce a ratio into NUMAtop, "LLC Latency ratio" = "the actual memory access latency" / "calibrated local memory access latency" We assume different node hop has different memory access latency, longer distance node hop has the longer memory access latency. This ratio will be near to 1 if most of the memory access of the application is to the local memory. So as a conclusion, here we propose the metrics of NUMAtop 1) sysload - cpu sensitive 2) LLC Miss per Instruction - memory sensitive 3) LLC Latency ratio - memory locality 4) the percent of the number of LMA/RMA access / total memory access - 4.1) LMA/(total memory access)% - 4.2) RMA/(total memory access)% - ... 4.2) could separate into different % onto different NUMA hop. These parameters are not platform specific and probably be common enough to extend to SPARC platform. Looking forward to your thoughts. BTW: Do we still need one more +1 vote for NUMAtop project? Thanks, -Aubrey ___ perf-discuss mailing list perf-discuss@opensolaris.org ___ perf-discuss mailing list perf-discuss@opensolaris.org
Re: [perf-discuss] NUMAtop for OpenSolaris
Li, Aubrey wrote: Hi Jonathan, Nice to see you have interest. We are discussing the metrics of NUMAtop, and so far the proposal is that the following parameters will be reported by NUMAtop as the metrics. 1) sysload - cpu sensitive 2) LLC Miss per Instruction - memory sensitive 3) LLC Latency ratio - memory locality 4) the percent of the number of LMA/RMA access / total memory access - 4.1) LMA/(total memory access)% - 4.2) RMA/(total memory access)% 4.2) could be separated into different % onto different NUMA node hop. These parameters are not platform specific and probably be common enough to extend to SPARC platform. Thanks for summarizing the metrics. However, I wanted to see a summary of the overall NUMAtop proposal given the feedback that you have gotten, so I can understand what the project is proposing to do now that you have gotten feedback. Then I can decide whether I have anything to add and whether I want to approve it as is or not. From the email thread so far, it looks as though Krish gave a very brief description of the project, Jin Yao explained some phases for the project, and you have listed some proposed metrics for the tool Have anything of these changed given the feedback that you have gotten? Can you please summarize your latest project proposal including the description, phases, metrics, and anything else that is useful for understanding what the project is proposing to do? Jonathan Jonathan Chew wrote: There has been a lot of discussion on this since it was proposed last month. I want to know what is currently being proposed given the lengthy discussion. Can someone please summarize what the current proposal is now? Jonathan Li, Aubrey wrote: johansen wrote: On Tue, Jan 12, 2010 at 02:20:02PM +0800, zhihui Chen wrote: Application can be categoried into CPU-sensitive, Memory-sensitive, IO-sensitive. My concern here is that unless the customer knows how to determine whether his application is CPU, memory, or IO sensitive it's going to be hard to use the tools well. "sysload" in NUMAtop can tell the customer if the app is cpu sensitive. "Last Level Cache Miss per Instruction" will be added into NUMAtop to determine if the app is memory sensitive. When CPU trigged one LLC miss, the data can be gotten from local memory, cache or memory in remote node. Generlly, the latency for local memory will be close to latency for remote cache, while latency for remote memory should be much higher. This isn't universally true. On some SPARC platforms, it actually takes longer to read a line out of a remote CPU's cache than it does to access the memory on a remote system board. On a large system, many CPU's may have this address in their cache, and they all need to know that it has become owned by the reading CPU. If you're going to make this tool work on SPARC, it won't always be safe to make this assumption. -j Thanks to point this issue out. We are not SPARC expert and I think SPARC NUMAtop design is not in our phase I design, :) We hope the SPARC expert like you or other expert can take SPARC into account and extend this tool onto SPARC platform. On systems where some remote memory accesses take longer than others, this could be especially useful. Instead of just reporting the number of remote accesses, it would be useful to report the amount of time the application spent accessing that memory. Then it's possible for the user to figure out what kind of performance win they might achieve by making the memory accesses local. As for the metric of NUMAtop, the memory access latency is a good idea. But the absolute amount is not a good indicator for NUMAtop. This amount will be different on different platforms, a specific number of amount is good on one platform while it's bad on another one. It's hard to tell the customer what data is good. So we will introduce a ratio into NUMAtop, "LLC Latency ratio" = "the actual memory access latency" / "calibrated local memory access latency" We assume different node hop has different memory access latency, longer distance node hop has the longer memory access latency. This ratio will be near to 1 if most of the memory access of the application is to the local memory. So as a conclusion, here we propose the metrics of NUMAtop 1) sysload - cpu sensitive 2) LLC Miss per Instruction - memory sensitive 3) LLC Latency ratio - memory locality 4) the percent of the number of LMA/RMA access / total memory access - 4.1) LMA/(to
Re: [perf-discuss] NUMAtop for OpenSolaris
Li, Aubrey wrote: Hi Jonathan, Do you have any comments about this proposal? Thanks, -Aubrey Li, Aubrey wrote: Jonathan Chew wrote: Thanks for summarizing the metrics. However, I wanted to see a summary of the overall NUMAtop proposal given the feedback that you have gotten, so I can understand what the project is proposing to do now that you have gotten feedback. Then I can decide whether I have anything to add and whether I want to approve it as is or not. From the email thread so far, it looks as though Krish gave a very brief description of the project, Jin Yao explained some phases for the project, and you have listed some proposed metrics for the tool Have anything of these changed given the feedback that you have gotten? Can you please summarize your latest project proposal including the description, phases, metrics, and anything else that is useful for understanding what the project is proposing to do? Jonathan NUMAtop focus on NUMA-related characteristic, it's a tool to help developers identify memory locality in NUMA systems. The tool is top-like that shows the top N processes in the system and their memory locality, with those processes that have the worst memory locality will be at the top of the list, it can attach into a process to show the threads memory locality in the top style as well. The information NUMAtop reported is collected from memory-related hardware counters and libcpc Dtrace provider. Some of these counters are already supported in kcpc and libcpc, while some of them are not. Intel Nehalem-based and next-generation platform provide memory load latency event, which is an important approach of NUMAtop and needs PEBS framework solaris implementation. The following proposed metrics will be one part of our phase I job. Application can be classified into CPU-sensitive, Memory-sensitive, IO- sensitive. IO-sensitive application can be idendified by low CPU utilization. Memory-sensitive application should be CPU-sensitive application with high CPU utilization. Can you please explain what you mean by CPU, memory, and I/O sensitive? What do these have to do with memory locality? So we have the following metrics: 1) sysload - cpu sensitive What do you mean by "sysload"? 2) LLC Miss per Instruction - memory sensitive So, is a memory sensitive thread one that has low or high LLC mis per instruction? After we figure out the application is memory-sensitive, we'll check memory locality metrics to see what is the performance regression cause. How will you do that? Do you mean that you will try to use the four metrics that you have listed here to determine the cause? 3) LLC Latency Ratio(Average Latency for LLC Miss/Local Memory Access Latency) Will the latency for each LLC miss be measured then? Is the local memory latency the *ideal* local memory latency when the system is unloaded or the *current* local memory latency which may be higher than the ideal because of load? 4) Source distribution for LLC miss: -4.1)LMA/(Total LLC Miss Retired)% -4.2)RMA/(Total LLC Miss Retired)% Will these ratios be given for each NUMA node, the whole system, or both? Here, 4.2) could be separated into different % onto different NUMA node hop. Do you mean that the total RMA will be broken down into percentage of remote memory accesses to each NUMA node from a given NUMA node? NUMAtop should have a useful report to show how effective the application is using the local memory. I think that someone already pointed out that you don't seem to mention anything about where the thread runs as part of your proposal even though that is pretty important in figuring out how effective a thread is using local memory. The thread won't be very effective using local memory if it never runs on CPUs where its local memory lives. Also, the memory allocation policy may matter too. For example, a thread may access remote memory a lot if it is accessing shared memory because the default memory allocation policy for shared memory is to spread it out by allocating it randomly across lgroups. We need PEBS framework to implement the metrics of NUMATOP, We need MPO sponsor and libcpc dtrace provider sponsor to figure out where is not effective and why. Ok. A better memory placement strategy suggestion is also a valuable goal of NUMATOP. How are you proposing to do that? Jonathan ___ perf-discuss mailing list perf-discuss@opensolaris.org