Am 28.11.2014 um 13:45 schrieb Paolo Bonzini: > > On 28/11/2014 13:39, Peter Lieven wrote: >> Am 28.11.2014 um 13:26 schrieb Paolo Bonzini: >>> On 28/11/2014 12:46, Peter Lieven wrote: >>>>> I get: >>>>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns >>>>> per coroutine >>>> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we >>>> have more >>>> than one thread in need of a lot of coroutines? >>> Overall the algorithm is expected to adapt. The N threads contribute to >>> the global release pool, so the pool will fill up N times faster than if >>> you had only one thread. There can be some variance, which is why the >>> maximum size of the pool is twice the threshold (and probably could be >>> tuned better). >>> >>> Benchmarks are needed on real I/O too, of course, especially with high >>> queue depth. >> Yes, cool. The atomic operations are a bit tricky at the first glance ;-) >> >> Question: >> Why is the pool_size increment atomic and the set to zero not? > Because the set to zero is not a read-modify-write operation, so it is > always atomic. It's just not sequentially-consistent (see > docs/atomics.txt for some info on what that means). > >> Idea: >> If the release_pool is full why not put the coroutine in the thread >> alloc_pool instead of throwing it away? :-) > Because you can only waste 64 coroutines per thread. But numbers cannot > be sneezed at, so it's worth doing it as a separate patch.
What do you mean by that? If I use dataplane I will fill the global pool and never use it okay, but then I use thread local storage only. So I get the same numbers as in my thread local storage only version. Maybe it is an idea to tweak the POOL_BATCH_SIZE * 2 according to what is really attached. If we have only dataplane or ioeventfd it can be POOL_BATCH_SIZE * 0 and we even won't waste those coroutines oxidating in the global pool. Peter