it looks like you are comparing these two functions void loopxinc(void) { uint i, x;
for(i = 0; i < N; i++){ _xinc(&x); _xdec(&x); } } void looplock(void) { uint i; static Lock l; for(i = 0; i < N; i++){ lock(&l); unlock(&l); } } but the former does two operations and the latter only one. your claim was that _xinc is slower than incref (== lock(), x++, unlock()). but you are timing xinc+xdec against incref. assuming xinc and xdec are approximately the same cost (so i can just halve the numbers for loopxinc), that would make the fair comparison produce: intel core i7 2.4ghz loop 0 nsec/call loopxinc 10 nsec/call // was 20 looplock 11 nsec/call intel 5000 1.6ghz loop 0 nsec/call loopxinc 22 nsec/call // was 44 looplock 25 nsec/call intel atom 330 1.6ghz (exception!) loop 2 nsec/call loopxinc 7 nsec/call // was 14 looplock 22 nsec/call amd k10 2.0ghz loop 2 nsec/call loopxinc 15 nsec/call // was 30 looplock 20 nsec/call intel p4 xeon 3.0ghz loop 1 nsec/call loopxinc 38 nsec/call // was 76 looplock 42 nsec/call which looks like a much different story. russ