Have a look at the ARM document "Barrier Litmus Tests and Cookbook", and especially section 7.2 "Acquiring and Releasing a Lock".
After reading this document, I came to the conclusion that the coherence() call in the unlock() function in port/taslock.c belongs before zeroing the l->key instead of after it. I made the change only for the bcm kernel because I haven't researched the memory semantics for all the other cpu architectures. I think it would probably be safe to make the change in ../port, but someone else can make that decision. I think it would also be safe to remove the coherence() call after zeroing the l->key, but I kept it in for paranoia's sake.