On Wed, Jan 6, 2010 at 5:53 AM, <johan...@sun.com> wrote: > On Tue, Jan 05, 2010 at 04:27:03PM +0800, Li, Aubrey wrote: >> >I'm concerned that unless we're able to demonstrate some causal >> >relationship between RMA and reduced performance, it will be hard for >> >customers to use the tools to diagnose problems. Imagine a situation >> >where the application is running slowly and RMA is not the cause, but >> >the tool shows high RMA. In such a case NUMAtop could add to the >> >difficulty of diagnosing the root cause of the customer's problem. >> >> If an application has reduced performance and high RMA, high RMA at least >> should be one part of the cause, Unless we can tell the customer the app >> has to allocate memory from a remote node. > > I don't think it's necessarily safe to conclude that. If an application > is memory-bound and has RMA, I agree. However, if the application is > CPU or I/O bound, the performance problem might not be due to RMA -- > espeically in the I/O case. >
> To use lockstat as an example: some customers run this tool, notice > numbers that look high to them, and then escalate. In many of these > cases, there's not actually a scalability problem but the tool can make > it easy to conclude that one might exist. I'm looking to avoid that, if > we possibly can. > Application can be categoried into CPU-sensitive, Memory-sensitive, IO-sensitive. CPU-intensive application can be easily identified by CPU utilization, but we can not assure that CPU-intensive application will be memory-intensive application because data can be gotten from L1/L2/L3 cache. One application with small dataset but with a big loop may have very small memory access. But on the other side, we can think that memory-intensive application should be CPU-insentive application because all memory access excluding DMA triggered by device are caused directly or indrectly by instructions. Based on this, NUMAtop can measure LLC Miss/Instruction (Last Level Cache Miss per Instruction). If application is CPU-sensitive and has high LLC Miss/Instruction (for example 0.2), we can identify this application as memory-sensitive and high RMA access should have imact to performance. For IO-sensitive application, it can be identified by low CPU ulization because application will wait for IO request or high CPI (cycles per instruction) ,this can be done with other tools such as prstat/iostat. Of course, IO-intensive application may trigger many DMA accessing from device to memory, this kind of memory access can be remote also, this is decided by the distance between memory and IOH. We can call this kind of NUMA as IO-NUMA, currently this feature is not included in NUMAtop, but in the phase II, we may add it. >> >We should also exercise care in choosing the type of metric that we >> >report, as some turn out to be meaniningless. Percent of CPU spent >> >waiting for I/O is a good example of a meaningless metric. >> > >> >> The metric is important to show the report. Now we are using the RMA# >> as the ordering rule. In order to show how effective the application is >> using the memory, we probably could use RMA#, LMA# and sysload together. > > Again to use lockstat, but this time as a positive example. It > initially used number of spins, when busy-waiting for a lock. This > makes it hard for the user to determine how much time is actually being > lost to spinning for a lock. The tool was recently changed to report > the amount of time spent spinnning, which is easier to understand and > more meaningful measurement. > > On systems where some remote memory accesses take longer than others, > this could be especially useful. Instead of just reporting the number > of remote accesses, it would be useful to report the amount of time the > application spent accessing that memory. Then it's possible for the > user to figure out what kind of performance win they might achieve by > making the memory accesses local. > When CPU trigged one LLC miss, the data can be gotten from local memory, cache or memory in remote node. Generlly, the latency for local memory will be close to latency for remote cache, while latency for remote memory should be much higher. NUMAtop can know the latency of local memory access at its statup, then measure lantency of LLC miss for application and compute the ratio (measured LLC miss lateny)/(local memory access latency). I will call this ratio as LLC latency ratio. If LLC lateny ratio is near 1, then it means that most of LLC miss just cause accesing for local memory or remote cache. If this ratio is high such as 3, then this means that many LLC miss leads to remote memory access, then this application should have some tuning or optimization opportunity. On the complicated system, the distance betwee node may be differents, there may be 0 hop (local memory), 1 hop, 2 hop, ..... When LLC miss happens, the data can be gotten from node with 0 hop, 1 hop, 2 hop,..... NUMAtop can report the percentage of data gotten from different hops for LLC misses, such as following: Tid Home Sysload "LLC Miss/Instruction" "LLC Latency ratio" 0-Hop 1-Hop 2-Hop 1 0 80% 0.2 2.3 30% 50% 20% When user see high sysload and LLC Miss/Instruction, he can know that this application is memory-bound. After that, he also sees high LLC latency ratio, he know that there are many remote memory access. So he he can try to reduce remote memory access by tuning or application optimization. Of course, all the measuremen for LLC miss latency, memory access hops needs the support of platform and libcpc. This means NUMAtop depends on these technology. Thanks Zhihui > -j > _______________________________________________ > perf-discuss mailing list > perf-discuss@opensolaris.org > _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org