Kristján Valur Jónsson <krist...@ccpgames.com> added the comment:

The counter is "stall cycles".
During the 10 second run on my 2.4Ghz cpu, we had instruction cache miss stalls 
for 2 billion cycles (2000 samples of 1000000 cycles per sample).  That does 
account for around 10% of the availible cpu.

I'm observing something like 20% slowdown, though, so there are probably other 
causes.

Profiling another counter, "instruction fetches", I see this, for a "fast run":
Functions Causing Most Work
Name    Samples %
Unknown Frame(s)        10.733  99,49

and for a slow run:
Functions Causing Most Work
Name    Samples %
Unknown Frame(s)        8.056   99,48

This shows a 20% drop in fetched instructions in the interval (five seconds 
this time).  Ideally, we should see 12000 samples in the fast case (2.4 ghz, 
5s) but we see 10000 due to what cache misses there are in this case.  The 
cache misses in the "slow" case causes effective instruction fetches to drop by 
20% on top of that.

I think that this is proof positive that the slowdown is due to instruction 
cache misses, at least on this dual core intel machine that I am using.

As for "the OS should handle this", I agree.  But it doesn't.  We are doing 
something unusual:  Convoying two (or more) threads allowing only one to run at 
a time.  The OS scheduler isn't built for that.  It can only assume that there 
will be some parallel execution and so it thinks that it is best to put the two 
sequential threads on different cpus.  But it is wrong, so the cost associated 
with cache misses outweighs the benefit of running on another core (zero, in 
our case).

So, the OS won't handle it, no matter how hard we wish that it would.  It is us 
that know how these gridlocked threads behave, and we do so much better than 
any OS scheduler can guess.  So, rather than beat our heads against the rock, 
I'm going to try to come up with a useful heuristic as to when to switch cores, 
and when not.  It would be useful as a diagnostic tool, if nothing more.

Ok, so we have established two things, I think:
1) the poor response of IO threads in the presence of CPU threads on 
thread_pthreads.h implementations (on multicore) is because of greedy gil wait 
semantics in the current gil.  It's easily fixable by using the implementation 
ROUNDROBIN_GIL implementation I've shown.
2) The poor performance of competing CPU threads on multicore machines is due 
to the instruction cache behaviour of non-overlapping thread execution on 
different cores.

We can fix 1) easily, even with a much less invasive patch than the ones I have 
put in here.  I'm a bit surprised at the apparent disinterest in such an 
obvious bug / fix.

As for 2), well, see above.  Nothing we can do, really, except identify those 
cases where we are releasing GIL just to yield (one case, actually, ceval.c) 
and try to instruct the OS not to switch cores in that case.  I'll see what I 
can come up with.

Cheers.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8299>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to