Gregs are awesome, apparently. Thanks for the confirmation. I know that threads are light-weight, it's just the first time I've ever run into something that uses them... so liberally. ^_^
On Mon, Aug 26, 2013 at 10:07 AM, Gregory Farnum <g...@inktank.com> wrote: > On Mon, Aug 26, 2013 at 9:24 AM, Greg Poirier <greg.poir...@opower.com> > wrote: > > So, in doing some testing last week, I believe I managed to exhaust the > > number of threads available to nova-compute last week. After some > > investigation, I found the pthread_create failure and increased nproc for > > our Nova user to, what I considered, a ridiculous 120,000 threads after > > reading that librados will require a thread per osd, plus a few for > > overhead, per VM on our compute nodes. > > > > This made me wonder: how many threads could Ceph possibly need on one of > our > > compute nodes. > > > > 32 cores * an overcommit ratio of 16, assuming each one is booted from a > > Ceph volume, * 300 (approximate number of disks in our soon-to-go-live > Ceph > > cluster) = 153,600 threads. > > > > So this is where I started to put the truck in reverse. Am I right? What > > about when we triple the size of our Ceph cluster? I could easily see a > > future where we have easily 1,000 disks, if not many, many more in our > > cluster. How do people scale this? Do you RAID to increase the density of > > your Ceph cluster? I can only imagine that this will also drastically > > increase the amount of resources required on my data nodes as well. > > > > So... suggestions? Reading? > > Your math looks right to me. So far though it hasn't caused anybody > any trouble — Linux threads are much cheaper than people imagine when > they're inactive. At some point we will certainly need to reduce the > thread counts of our messenger (using epoll on a bunch of sockets > instead of 2 threads -> 1 socket), but it hasn't happened yet. > In terms of things you can do if this does become a problem, the most > prominent is probably to (sigh) partition your cluster into pods on a > per-rack basis or something. This is actually not as bad as it sounds > since your network design probably would prefer not to send all writes > through your core router, so if you create a pool for each rack and do > something like this rack, next rack, next row for your replication you > get better network traffic patterns. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com