Maybe what I said makes less sense in the case of NIO vs blocking with threads - I've mainly been working with Intel Threading Building Blocks lately, where the cost of cache cooling is very real. For that reason (and the others mentioned - context switches and lock preemption), Intel Threading Building Blocks will try and run a single thread per core and load balance work amongst these.
I wouldn't say its just one thing (eg context switching), but a combination of thread memory overheads, thread creation and destruction, context switching, cache cooling, false sharing and lock preemption. I also don't know how valid this is to Java/Clojure or web servers, though I don't see why it wouldn't be just as valid as any other multicore code. Finally, I don't think you can really achieve true scalability by not using a multicore processors cores (by using a pure asynchronous server) - you would want to use the available cores. I'm just saying that having more threads than cores (or rather, more software threads than hardware threads) may hurt performance or scalability due to time slicing overheads. Obviously its more complicated than simply creating N worker threads for an N-core system though, since if any blocking IO is performed the cores are under-utilized. "just because you're not switching threads doesn't mean that different requests will not need to e.g. touch different cache lines" Yes, of course! I didn't mean to imply that an asynchronous server would save you from this. However, in an asynchronous server, (or, more importantly, in one where the number of threads do not exceed the number of hardware threads) it becomes much more likely that a request is processed to completion before it gets evicted from the cache (as long as care is taken to prevent false sharing with other, independent data which share the cache lines). As for locks, making sure to hold locks for as short a time as possible is a well known pattern, so I agree, the likelihood of switching away at the wrong time (and having another thread then try and aquire that same lock) is very low, but it can and does still happen on occasion - and when it does, it can really hit performance. (of course, the performance hit for a web application might not even be noticable - there wouldn't be so many web apps written in PHP, Ruby and Python!) Anyway, web servers aren't my area of expertise, so please ignore me if this isn't at all relevant to the discussion. Still, I am very interested to hear yours and everyone elses real world experiences. On an aside, callbacks aren't really all that cache frindly, unless great care is taken. OO isn't the greatest model for cache friendly multicore code either. Maybe thats one reason I like Clojures sequence abstraction as much as I do. On 8 July 2010 23:41, Peter Schuller <peter.schul...@infidyne.com> wrote: > > Under heavy load, this can be quite costly, especially if each request > > requires non-trivial processing (ie, enough to make time-slicing kick > > in). > > This doesn't really jive with reality as far as I can tell; if > anything it is the exact opposite of reality. If you're doing > significant work in between doing I/O calls (which tend to be context > switching points) even to the point of usually yielding only to > pre-emptive switching resulting from exceeding your time slice, the > relative overhead of threading should be much less (usually) than if > you're just doing a huge amount of very very small requests. > > Whatever the extra cost is in a thread context switch compared to an > application context switch (and make no mistake, it's effectively > still a context switch; just because you're not switching threads > doesn't mean that different requests will not need to e.g. touch > differens cache lines, etc), that becomes more relevant as the amount > of work done after each switch decreases. > > The cost of time slicing while holding a lock is real, but if you have > a code path with a high rate of lock acquisition in some kind of > performance critical situation, presumably you're holding locks for > very short periods of time and the likelyhood of switching away at > exactly the wrong moment is not very high. > > Also: Remember that syscalls are most definitely not cheap, and an > asynchronous model doesn't save you from doing syscalls for the I/O. > > > So, between memory overheads, cost of creating and destroying threads > > and context switching, using a synchronous model can be extremely > > heavyweight compared to an asynchronous model. Its no surprise that > > people are seeing much better throughput with asynchronous servers. > > In my experience threading works quite well for many production tasks, > though not all (until we get better "vertical" (all the way from the > language to the bare metal) support for cheaper threads). The > maintenance and development costs associated with writing complex > software in callback form with all state explicitly managed, disabling > any use of sensible control flow, exceptions, etc, is very easy to > under-estimate in my opinion. It also makes every single call you ever > make have part of it's public interface whether or not it *might* do > I/O, which is one particular aspect I really dislike other than the > callback orientation. > > You also need to consider latency. While some flawed benchmarks where > people throw some fixed concurrency at a problem will show that > latency is poor with a threaded model in comparison to an asynch > model; under an actual reasonable load where the rate of incoming > requests is not infinitely high, the fact that you're doing > pre-emption and scheduling across multiple CPU:s will mean that > individual expensive requests don't cause multiple other smaller > requests to have to wait for it to complete it's bit of work. So > again, for CPU-heavy tasks, this is another way in which a threaded > model can be better unless you very carefully control the amount of > work done in each reactor loop (presuming reactor pattern) in the > asynchronous case. > > As far as I can tell, the advantages from an asynchronous model mostly > come in cases where you either (1) have very high concurrency or (2) > are doing very very little work for each unit of I/O done, such that > the cost of context switching is at it's most significant. > > My wet dream is to be able to utilize something like Clojure (or > anything other than callback/state machine based models) on top of an > implementation where the underlying concurrency abstraction is in fact > really efficient (in terms of stack sizes and in terms of switching > overhead). In other words, the day where having a few hundred thousand > concurrents connections does *not* imply that you must write your > entire application to be event based, is when I am extremely happy ;) > > -- > / Peter Schuller > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com> > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > -- Daniel Kersten. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en