David McDaniel wrote:
Being insufficiently familiar with kernel goings-on, I've yet to find the answer to what someone more familiar may know off the top of the head. Are the TLB contents saved and restored across context switches?
On SPARC, TLB entries are tagged w/ the context that created them. Thus, old TLB entries not displacement flushed by the new process's TLB misses are available after a context switch back to the original process. On x86, a change in the context register CR3 _appears_ to flush the TLB. This isn't actually what happens on modern CPUs, but one can code as if this is the case as the CPU actually snoops the memory locations containing the current TLB entries, current context or not.
Or are they simply invalidated and lazily restored upon thread resumption? Or something altogether different.
If your interested enough to read this far, the reason for the question is that a certain application randomly accesses a fairly large dataset that consists of a number of memory mapped files. It's performance suffers from (among other things) high DTLB miss rates. So, in addition to leveraging large pages in some cases, I had a couple of other ideas which are sort of client-server-ish but imply context switching. So if the TLBs are not saved and restored that can only make the problem worse and I wont waste my time going down that road.
Unless breaking up the app into client and server processes allows you to either run on multiple cores (eg expand available TLB resources) or significantly improves temporal locality, this isn't likely to help. Your best bets to improve performance, prob. in order: 0) try a T2000; this workloads sounds like it would be perfect as long as there's no floating point. 1) improve application algorithms/data structures to improve TLB locality. Some examples include: contiguous allocation of hash chain blocks to avoid cache and TLB misses during searches, use of cache and tlb friendly heap allocators such as libumem, etc. 2) partitioning access to data space to separate threads bound to different CPUs (either cores or sockets). 3) use of large pages More details about the data structures, machine architecture and CPU count would allow more targeted suggestions.... - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org