Re: [perf-discuss] thread_reaper can't keep up with massive creation of threads

Steve Sistare Thu, 22 Oct 2009 13:43:53 -0700

Hi Thomas,
  The thread_reaper() rate of 2500 threads/second is limiting your
throughput and sounds low.  Why is it expensive?  I have a theory.
Here is the code, with subroutine calls:


thread_reaper()
    for (;;)
      cv_wait();
      thread_reap_list()
          for each thread
              thread_free()
                  segkp_release()
                      segkp_find()
      delay(1 second);

The delay is an issue (though necessary for reasons I will not go into),
but not *the* issue.  Here is segkp_find():
  i = SEGKP_HASH(vaddr);
  find the thread in hash chain list kpsd_hash[i];

and associated macros:

  #define SEGKP_HASH(vaddr)       \
        ((int)(((uintptr_t)vaddr >> PAGESHIFT) & SEGKP_HASHMASK))
  #define SEGKP_HASHMASK          (SEGKP_HASHSZ - 1)
  #define SEGKP_HASHSZ            256     /* power of two */

Your reap deficit of 1669150 - 1506682 = 162468 says you have
at least 162468 non-destroyed threads at one point, so I use that
figure in a model below.  (FYI, your 8GB sekgpsize would support
256K threads -- a default stack size of 24KB plus 8KB redzone per
thread).

There are SEGKP_HASHSZ=256 hash buckets in kpsd_hash[].  However, the
hash distribution is poor.  Each thread stack is 32KB including its
redzone, bit the shift above only shifts by 1 page, so the 2 low
bits after shift are always 0, so only 1/4 of the buckets  -- ie 64
buckets -- are populated.  So on average those buckets have a list
length of 162468 threads / 64 buckets = 2539 elements.

The list search will visit half of the elements on average, in the
worst case taking a remote memory latency for each of approx 300 ns
on the T5440, so the search cost per thread freed is 2539/2*300 ns =
380850 ns.  The max rate per second is thus 1e9 ns / 380850 ns =
2625 threads/sec.  The delay(1 second) makes it worse.  Pretty close
to what you observe, for a very rough model.  However, my assumption
that half of the list is visited on average may be wrong.  Thread
stacks are prepended to the hash chain on creation, and threads are
prepended to the deathrow on destruction, so the deathrow traversal
order will match the hash chain traversal order if threads are
destroyed in approximately the same order they are created.

If the theory is correct, the fix in Solaris is pretty easy; increase
SEGKP_HASHSZ and define a more robust hash function that accommodates
default and non default stack sizes with a good distribution.  I will
file a CR if we can confirm the theory.

As a workaround, try setting a non-default stacksize.
In /etc/system, set default_stksize=32768.  The kernel will add
an 8KB redzone, so the total will be 40KB, and the 2 low hash bits
will no longer be 0, so the hash distribution will be better.
Let me know if that helps.

- Steve

Hi all,
it seems like thread_reaper can't keep up with massive creation of
threads.  I'm currently analyzing a problem with our
SAN-virtualization. IO stucks for about 40 secs whenever segkp-cache
gets filled up. I tried to increase sekgpsize from default 2gb to 8gb
but that did not help. I just have to do more IO or even do IO a
little longer and the hang occurs as well.  Yes, there's a big
problem with the virtualization agent running on our machines. My
test machine is a T5440 with 4x 1.4GHz and 128gb RAM. For every
single IO (=interrupt), the agent creates a thread to map the blocks
from virtual to physical. These threads are very short-living as they
exit immediately after handling one IO. I know this should be
accomplished with worker threads or threads handling a bulk of
IOs/interrupts...

So if I run a filebench with 8kb unbuffered writes on one of these
virtualized volumes, the test creates 1669114 ops in 6 minutes (~4600
IO/s with a max of 15,000 IO/s). segkp is 8gb.  Usin dtrace, I can
see the amount of threads reaped by the thread_reaper and created by
the agent in one second and in total.  The values are quite equal but
indeed, they are not.  As the filebench stops I don't see any more
new new threads from the agent, but thread_reaper is still running
for about 30 secs, reaping ~2500 threads per second.  If I stop the
tracing directly after filebench has finished, I see this result:
--- TOTALS ---
 threads reaped        1506682
  smv threads created   1669150

As I said, filebench did 1669114 ops in total and we can see them
here again (1669150). But there's a difference between reaped and
created threads of 162468 !! As we are on a sparc machine with 8k
pagesize and 3x pagesize per thread, there are 3.7gb of freeable
space. I expected the reaper keeps up with thread_creation, but I
was wrong.

If the test is doing more IO (= more threads) and the runtime is
increased, the sekpg fills up completely and the segkp gets locked
up to free threads. This is causing an IO hang for about 40 secs.
In my point of view, the algorithm of the thread_reaper is not
optimal. It must be possible to create this amount of threads in a
very short time without getting problems with filled caches.  Is
there anything else I can do to "tune" up the thread_reaper?

Any comment regarding this problem is welcome. I am willing to test
your suggestions or to provide some more detailed information -
just let me know.

Thank you all for your help!
Best regards, Thomas

-- This message posted from opensolaris.org

_______________________________________________

perf-discuss mailing list
perf-discuss@opensolaris.org


_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] thread_reaper can't keep up with massive creation of threads

Reply via email to