David McDaniel wrote:
Hello. I am investigating performance characteristics of some
processes which taken as a group dont perform as well as hoped. Basic
characterisitics: These processes are 32 bit apps, multithreaded to a
greater or lesser degree, share a fairly large (~2GB) dataset via
mmap'd files. The processes are compiled on Studio 8 with no
optimization whatsoever.
Unless you want a full symbolic debugging (line breaks and all),
you should compile with optimization.
You could easily see 2x performance improvement,
and if you're luck, 10x or even 100x is not unheard of,
when you compare non-optimized code vs optimized code.
The representative target machine is a 1280
with from 4-12 cpus. Using collect() from studio10, I use hardware
counters to collect the instruction count and IC_miss numbers for a
representative interval. I observe that the ration between the two is
~9:1, ie over an interval in which ~54 million instructions are
completed, appx 6 million IC_misses are reported.
That's extremely high I$ miss rate - it's actually almost impossible
to make I$ miss that often. So something's fishy here.
Similarly,
capturing clock cycles vs instructions, I see a net instruction rate
of about 233 million instructions/sec on 900Mhz cpus. Since these
guys have two integer pipelines thats pretty poor.
If I$ miss rate is that high, it's no wonder IPC is that low.
So the first
question is, is IC_miss reporting only the on-chip instruction cache
stats, and if so is there a way to determine how many of those misses
also missed (or hit...I can do the math) in the external cache.
Yes. EC_ic_miss event will count E$ misses from I$ requests.
Also,
since the instruction stream is pretty predictable once initiated, is
there a way for us to do explicit prefetches of "future functions"
that might help us reduce this phenomenon?
Currently prefetch for instructions is not implemented on US3/3+/4.
The limited references
I've seen to using prefetch all discuss prefetching data but not
instructions. Its possible I'm totally barking up the wrong tree, but
thats the nice thing about starting from the beginning... there are
opportunities every where :-)
Something's extremely wrong - US3 I$ has 32byte line, which means
8 instructions per line.
You're almost getting one I$ miss per 9 instruction,
meaning almost every I$ line access is a I$ miss.
That's almost impossible to achieve in a normal program.
What on earth is your code doing ?
Seongbae
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org