http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47031

--- Comment #4 from Nicola Pero <nicola at gcc dot gnu.org> 2011-01-07 17:56:01 
UTC ---
Jonathan

thanks for your comments - they are very useful and we certainly want to look
at performance. ;-)

I'm not terribly convinced by using spinlocks in this context, but I'm happy to
be convinced otherwise. :-)

---

To understand what the order of magnitude is, I created a short program that
would loop and run 100m synthesized atomic, id setters.  I used GCC 4.6 to
compile, and gnustep-make/gnustep-base from trunk:

 * with the standard libobjc (objc_mutex_lock/objc_mutex_unlock), it takes
approximately 16.8 seconds to do the 100m setter calls

 * if I replace objc_mutex_lock with pthread_mutex_lock and objc_mutex_unlock
with pthread_mutex_unlock, it takes approximately 15.3 seconds (this basically
tests how fast we can go if we rework the objc_mutex... wrapper API)

 * if I comment out all the objc_mutex_lock/objc_mutex_unlock entirely, it
takes approximately 8.7 seconds.

This means that, assuming that spinlocks are infinitely faster than mutexes, in
the best possible conditions they would speed up the accessors (or, at least,
the setter, but the getter should be the same) by a maximum of about 2x. 
That's a great speedup by the way, but it's important to keep in mind that the
maximum speedup we can theoretically ever get for accessors is a 2x, not a 10x
or 100x. ;-)

---

Anyway, the key problem is that spinlocks are more wasteful (and unpredictable
?) than mutexes if there is contention and if the ratio of active threads vs.
active cores is not favourable.  Inside a kernel, when working on a specific
part, it is easy to assess these factors.  But libobjc is providing a
general-purpose, portable user-space facility for having atomic, synchronized
access to an unspecified property of an ObjC object.  It is hard to see how you
can guarantee anything about contention or ratio of active threads vs. active
cores in that context. :-(

I guess I'd feel a bit more confident if we were to experiment and benchmark
the worst-case scenario of spinlocks ... ie, how bad can they really go
compared to mutexes and how easy/hard it is for them to go bad in the case of
accessors.

Anyway, I understand you think spinlocks will always be faster than mutexes in
this case, because "My guess is that usually the spinlock is not held", but the
problem is that that is a guess, and it depends on how the accessors are used.
;-)

In particular --

 * it matters how often you call the accessors; don't forget that in the getter 
a method call (-retain) is inside the critical zone.  Even ignoring that the 
method call may trigger the execution of +initialize for the class (which would 
then execute an unbounded amount of user code into the critical zone, but which 
is fairly theoretical as the object has most likely been allocated using +alloc 
before getting a -retain, meaning +initialize should have already been done), a 
method call still constitutes about 25% of the time taken by the getter 
(ignoring locking overheads), so if your program spends 100% of the time 
calling a getter from two threads, any thread has up to a 25% chance of hitting 
the lock held by the other thread;

 * it matters how many threads you have.  If each thread spends 10% of its time 
calling a bunch of getters with the same lock, then each spends 2.5% of its 
time inside the critical zone.  If you have 2 threads, you have 2.5% of chance 
that your thread will try to enter the critical zone while the other thread is 
inside it.  But if you have 100 threads, the chances will be much much 
higher! (having 100s of threads is pretty common in server software where you 
are processing large numbers of indipendent requests simultaneously; they 
should be lightly coupled, so there is little synchronization to do; but in the 
case of synthesized property accessors, unfortunately, the locks are shared by 
completely unrelated objects, so even if the 100 threads are completely 
unrelated, as soon as they use atomic synthesized accessors, they will be using 
and competing for the same locks, exactly as if they were constantly 
synchronizing access to some shared resources!  Are spinlocks still appropriate
in this context ?)

 * it matters how many active cores you have.  If a thread tries to enter the
critical zone while another thread is holding the lock, then with mutexes it
doesn't matter if the other thread is active or suspended - there will be a
context switch in both cases; but with spinlocks it does.  If the other thread
is running on a different, active core, the first thread should waste a bit of
CPU and then enter it OK.  But if the second thread is running on the same
core, it will never make any progress during the spinning, and the first thread
will waste CPU time before being suspended - in this case it will be slower and
more CPU-wasteful than a mutex.

 * I also wonder, with all the CPUs that GCC support, if the GCC atomic
builtins are efficient on all platforms; if on a platform the builtins require
function calls they may actually make the spinlocks less attractive compared to
mutexes (we can obviously work around this by using spinlocks on the two or
three major platforms where we know they work well, and using mutexes on the
other ones).

Maybe the right way forward would be to experiment with some worst-case 
benchmarks for spinlocks vs mutexes.  Spinlocks can speed up accessors by up to 
2x in the right conditions.  How much can they speed them down in the wrong 
conditions ?  Eg, if you start up 100 threads on a 1 core machine, and have
them all access a getter concurrently for 1m times, how much slower would
spinlocks cause the program to execute ?

I would have no problem changing my mind and supporting the use of spinlocks if 
there is some evidence that they can't go *too* wrong ;-)

At the moment, I feel the performance trade-offs are unclear and we'd be using
spinlocks only because Apple is using them in their runtime.

Thanks

Reply via email to