Re: [perf-discuss] NUMAtop for OpenSolaris

zhihui Chen Mon, 11 Jan 2010 22:20:44 -0800

On Wed, Jan 6, 2010 at 5:53 AM,  <johan...@sun.com> wrote:
> On Tue, Jan 05, 2010 at 04:27:03PM +0800, Li, Aubrey wrote:
>> >I'm concerned that unless we're able to demonstrate some causal
>> >relationship between RMA and reduced performance, it will be hard for
>> >customers to use the tools to diagnose problems.  Imagine a situation
>> >where the application is running slowly and RMA is not the cause, but
>> >the tool shows high RMA.  In such a case NUMAtop could add to the
>> >difficulty of diagnosing the root cause of the customer's problem.
>>
>> If an application has reduced performance and high RMA, high RMA at least
>> should be one part of the cause, Unless we can tell the customer the app
>> has to allocate memory from a remote node.
>
> I don't think it's necessarily safe to conclude that.  If an application
> is memory-bound and has RMA, I agree.  However, if the application is
> CPU or I/O bound, the performance problem might not be due to RMA --
> espeically in the I/O case.
>


> To use lockstat as an example: some customers run this tool, notice
> numbers that look high to them, and then escalate.  In many of these
> cases, there's not actually a scalability problem but the tool can make
> it easy to conclude that one might exist.  I'm looking to avoid that, if
> we possibly can.
>

Application can be categoried into CPU-sensitive, Memory-sensitive,
IO-sensitive.

CPU-intensive application can be easily identified by CPU utilization,
but we can not assure that CPU-intensive application will be
memory-intensive application because data can be gotten from L1/L2/L3
cache. One application with small dataset but with a big loop may have
very small memory access. But on the other side, we can think that
memory-intensive application should be CPU-insentive application
because all memory access excluding DMA triggered by device are caused
directly or indrectly by instructions. Based on this, NUMAtop can
measure LLC Miss/Instruction (Last Level Cache Miss per Instruction).
If application is CPU-sensitive and has high LLC Miss/Instruction (for
example 0.2), we can identify this application as memory-sensitive and
high RMA access should have imact to performance.

For IO-sensitive application, it can be identified by low CPU
ulization because application will wait for IO request or high CPI
(cycles per instruction) ,this can be done with other tools such as
prstat/iostat. Of course, IO-intensive application may trigger many
DMA accessing from device to memory, this kind of memory access can be
remote also, this is decided by the distance between memory and IOH.
We can call this kind of NUMA as IO-NUMA, currently this feature is
not included in NUMAtop, but in the phase II, we may add it.

>> >We should also exercise care in choosing the type of metric that we
>> >report, as some turn out to be meaniningless.  Percent of CPU spent
>> >waiting for I/O is a good example of a meaningless metric.
>> >
>>
>> The metric is important to show the report. Now we are using the RMA#
>> as the ordering rule. In order to show how effective the application is
>> using the memory, we probably could use RMA#, LMA# and sysload together.
>
> Again to use lockstat, but this time as a positive example.  It
> initially used number of spins, when busy-waiting for a lock.  This
> makes it hard for the user to determine how much time is actually being
> lost to spinning for a lock.  The tool was recently changed to report
> the amount of time spent spinnning, which is easier to understand and
> more meaningful measurement.
>
> On systems where some remote memory accesses take longer than others,
> this could be especially useful.  Instead of just reporting the number
> of remote accesses, it would be useful to report the amount of time the
> application spent accessing that memory.  Then it's possible for the
> user to figure out what kind of performance win they might achieve by
> making the memory accesses local.
>

When CPU trigged one LLC miss, the data can be gotten from local
memory, cache or memory in remote node. Generlly, the latency for
local memory will be close to latency for remote cache, while latency
for remote memory should be much higher. NUMAtop can know the latency
of local memory access at its statup, then measure lantency of LLC
miss for application and compute the ratio (measured LLC miss
lateny)/(local memory access latency). I will call this ratio as LLC
latency ratio. If LLC lateny ratio is near 1, then it means that most
of LLC miss just cause accesing for local memory or remote cache. If
this ratio is high such as 3, then this means that many LLC miss leads
to remote memory access, then this application should have some tuning
or optimization opportunity.

On the complicated system, the distance betwee node may be differents,
there may be 0 hop (local memory), 1 hop, 2 hop, .....
When LLC miss happens, the data can be gotten from node with 0 hop, 1
hop, 2 hop,..... NUMAtop can report the percentage of data gotten from
different hops for LLC misses, such as following:

Tid     Home    Sysload    "LLC Miss/Instruction"  "LLC Latency ratio"
     0-Hop         1-Hop    2-Hop
1          0            80%               0.2
        2.3                 30%            50%     20%

When user see high sysload and LLC Miss/Instruction, he can know that
this application is memory-bound. After that, he also sees high LLC
latency ratio, he know that there are many remote memory access. So he
he can try to reduce remote memory access by tuning or application
optimization.

Of course, all the measuremen for LLC miss latency, memory access hops
needs the support of platform and libcpc. This means NUMAtop depends
on these technology.


Thanks
Zhihui

> -j
> _______________________________________________
> perf-discuss mailing list
> perf-discuss@opensolaris.org
>
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] NUMAtop for OpenSolaris

Reply via email to