On Wed, 4 Jan 2012, Chuck Swiger wrote:
Hi--
On Jan 4, 2012, at 2:23 PM, Dan The Man wrote:
It is not arbitrary. Systems ought to provide sensible limits, which can be adjusted if
needed and appropriate. The fact that a system might have 50,000 file descriptors
globally available does not mean that it would be OK for any random process to consume
half of them, even if there is still adequate room left for other tasks. It's common for
"ulimit -n" to be set to 256 or 1024.
Sensibly limits means a sensible stock default, not imposing an OS limit on
what admin/developer can set on his own hardware.
In point of fact, protocols like TCP/IP impose limits on what is possible. It is in fact the job
of the OS to say "no" when a developer asks for a TTL of a million via setsockopt(),
because RFC-791 limits the maximum value of the "time to live" field to 255.
With the new IBM developments underway of 16 core atom processors and hundreds
of gigabytes of memory, surely a backlog of 100k is manageable. Or what about
the future of 500 core systems with a terrabyte of memory, 100k listen queue
could be processed instantly.
Um. I gather you don't have much background in operating system design or
massively parallelized systems?
Due to locking constraints imposed by whatever synchronization mechanism and
communications topology is employed between cores, you simply cannot just add more
processors to a system and expect it to go faster in a linear fashion. Having 500 cores
contending over a single queue is almost certain to result in horrible performance. Even
though the problem of a bunch of independent requests is "embarrassingly
parallelizeable", you do that by partitioning the queue into multiple pieces that
are fed to different groups or pools of processors to minimize contention over a single
data structure.
I guess your calling me out to talk about what I'm doing based on that
statement:
First framework I was working on a few weeks back just had a parent
bind socket, then spawn a certain amount of children to do the accept on
the socket, so parent could just focus on dealing with SIGCHLD and what
not. I had issues with this design for some reason, all the sockets were
set to non-blocking etc, and using kqueue to monitor the socket, but
randomly I would have a 1-2 second delay at times from a child doing an accept, I was horrified
and changed design quickly.
New design, parent does all the accepts and passes blocking work to
children via socketpairs it created when forking. Now you talk about
scaling on multiple cores, well each child could have its own core to do
its blocking I/O on and each have its own processor time, which isn't
parallism , but I never said it was doing that.
The better part of this design is you have 1 process
utilizing a processor efficiently instead of paging the system with
useless processes. Also could could have other machines connect in to
parent and it could do same thing it does with children via a socket, so
in my opinion its more scalable and can centralize everything in one spot.
Obviously some cons to this design, you are passing data via socket pairs
instead of child writing directly to client.
To stress test this new design I simply wrote an asycronouse client
counterpart to create 100k of connections to parents listen queue, then it
would go off writing to each socket, of course soon as I reached 60k or so
client would get numerous failures due to OS limits. So my intention was
to see how long it would take children to process request and send
response back to client, starting from listen queue with 100k of fd's
ready to go I thought would have been really nice test not only for
testing applications speed but also testing cpu usage, I/O usage etc with
parent processing a client trying to talk to it 100k times at once to
really see how kqueue does.
Without being able to increase simple limits like these how ever going to
find where we can burn down the system and make it outperform epoll() one
day.
What it so bad to see how many fd's I could toss at kqueue before it
croaked? @60k was still handling like a champ with about 50 children
getting handed work in my tests.
Yes. If the system doesn't handle connectivity problems via something like
exponential backoff, then the weak point is poor software design and not
FreeBSD being unwilling to set the socket listen queue to a value in the
hundreds of thousands.
I think what me and Arnaud are trying to say here, is let freebsd use a
sensible default value, but let the admin dictate the actual policy if he so
chooses to change it for stress testing, future proofing or anything else.
FreeBSD does provide a sensible default value for the listen queue size. It's
tunable to a factor of about 1000 times larger, and is a value which is
sufficiently large to hold a backlog of several minutes worth of connections,
assuming you can process the requests at a very high rate to keep draining the
queue.
There probably isn't a reasonable use-case for queuing unprocessed requests for
longer than MAXTTL, which is about 4 minutes. So, it's conceivable in theory
for a high-volume server to want to set the listen queue to, say 1000 req/s *
255 (ie, MAXTTL), but I manage high volume servers for a living, and practical
experience including measurements of latency and service performance suggests
that tuning the listen queue up to on the order of a thousand or so is the
inflection point after which it is better/necessary for the software to
recognize and start doing overload mitigation then it is for the OS to blindly
queue more requests.
Put more simply, there comes a point where saying "no", ie, dropping the
connection with a reset, works better.
I agree, listen queue of course will go to something reasonable when i was
done with testing.
Dan.
--
Dan The Man
CTO/ Senior System Administrator
Websites, Domains and Everything else
http://www.SunSaturn.com
Email: d...@sunsaturn.com
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"