David McDaniel wrote:
Being insufficiently familiar with kernel goings-on, I've yet to find the answer to what someone more familiar may know off the top of the head. Are the TLB contents saved and restored across context switches?

On SPARC, TLB entries are tagged w/ the context that created them.
Thus, old TLB entries not displacement flushed by the new process's
TLB misses are available after a context switch back to the original
process.

On x86, a change in the context register CR3 _appears_ to flush the
TLB.  This isn't actually what happens on modern CPUs, but one can
code as if this is the case as the CPU actually snoops the memory
locations containing the current TLB entries, current context or not.

Or are they simply invalidated and lazily restored upon thread resumption? Or something altogether different.

If your interested enough to read this far, the reason for the question is that a certain application randomly accesses a fairly large dataset that consists of a number of memory mapped files. It's performance suffers from (among other things) high DTLB miss rates. So, in addition to leveraging large pages in some cases, I had a couple of other ideas which are sort of client-server-ish but imply context switching. So if the TLBs are not saved and restored that can only make the problem worse and I wont waste my time going down that road.

Unless breaking up the app into client and server processes
allows you to either run on multiple cores (eg expand available
TLB resources) or significantly improves temporal locality, this
isn't likely to help.

Your best bets to improve performance, prob. in order:

0) try a T2000; this workloads sounds like it would be perfect
as long as there's no floating point.

1) improve application algorithms/data structures to improve
TLB locality.  Some examples include:  contiguous allocation
of hash chain blocks to avoid cache and TLB misses during
searches, use of cache and tlb friendly heap allocators such
as libumem, etc.

2) partitioning access to data space to separate threads bound
to different CPUs (either cores or sockets).

3) use of large pages

More details about the data structures, machine architecture
and CPU count would allow more targeted suggestions....

- Bart

--
Bart Smaalders                  Solaris Kernel Performance
[EMAIL PROTECTED]               http://blogs.sun.com/barts
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Reply via email to