On Thu, Apr 10, 2008 at 01:02:34PM -0700, Alexander Kolbasov wrote: > > It may be useful to observe prstat -mL which will report micro-state > accounting data for each thread. > prstat -mL takes a *lot* of time with 16k threads. Nevertheless, I have some further data: reruning the dtrace gave me the output of top 10 consumers of CPU time for both CPUs: the winner is "idle" on CPU0 (671 samples from 2029 samples; the next highest has 114 samples), and again "idle" on CPU1 (776/3891 samples, the next highest has 106 samples). Ordinary prstat shows that my process is often in sleep state.
Furthermore, I do not think that the problem lies in TLB trashing. Here are three different runs: 2^28 B total block size (256MB), 2^14 B chunk size (= also 2^{28-14} threads), 2^7 repetitions (= 2^35 B (32 GB) encrypted in total): 33.6 seconds 2^24 B total block size (16MB), 2^10 B chunk size (= again 2^14 threads), 2^7 repetitions (= 2^31 B (2 GB) encrypted in total): 25.4 seconds 16MB block size is only twice the TLB capacity (2048 entries x 4kB = 8MB). Lowering the block size to 4MB (half the TLB capacity) gives the following: 2^22 B total block size (4 MB), 2^8 B chunk size (= 2^14 threads), 2^7 repetitions ( = 2^29 B (512 MB) encrypted in total): 24.5 seconds == Is there some backoff heuristics in the mutex/CV/whatever code that puts the thread to sleep under high contention? Adaptive mutexes? I'm off to browse the opensolaris code on the net. == _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org