On Aug 5, 2009, at 05:06, Andy Wingo wrote:
Guile's VM does not yet have opcodes for +1 and and -1. It probably
should, though. Still, this is a fairly artificial test.
Yeah, and it turned out to be useless for what I had wanted to test at
the start, but I was still curious to see how it would work out.
Based on your analysis of the "while" issue, it turned out to be
interesting. :-)
So, you're also timing Guile startup here, which takes about 20 ms.
Yes; except for the smallest cases, startup seems to be in the noise.
Poking at it a couple of times with Apple's process sampling tool, it
does seem to be spending a bit of time in garbage collection -- but
scm_leave_guile (called from scm_async_click via
scm_pthread_mutex_lock) and getrusage (called from
scm_c_get_internal_run_time via mytime) also seem to show up on the
stack a significant fraction of the time, and each of those appears
to
be a comparable if not bigger time sink.
This could be simply that Mac OS has different performance
characteristics than Linux. If locking uncontended mutexen or getting
the resource usage of a process are expensive, that doesn't sound very
good...
From my experience with profiling code on another project, mutexes
definitely appear more expensive on Mac OS X vs Linux/glibc when
there's contention. I just ran a quick test on a couple of x86
machines, and it does appear that the Mac OS X version of
pthread_mutex_lock+pthread_mutex_unlock takes half again as many
cycles as the GNU/Linux version, without contention. (A test with
10000000 iterations of getrusage(RUSAGE_SELF) shows a more drastic
difference between OSes -- in terms of cycle counts, more than 3.5x as
expensive on Mac OS X as on GNU/Linux.)
But it's not pthread_mutex_lock that's showing up most often, it's
sigaltstack and sigprocmask, called by scm_leave_guile. Some thoughts
come to mind, but I don't have much time to work with them right now,
maybe later:
(1) In scm_pthread_mutex_lock, we leave and re-enter guile mode so
that we don't block the thread while in guile mode. But we could use
pthread_mutex_trylock first, and avoid the costs scm_leave_guile seems
to incur on the Mac. If we can't acquire the lock, it should return
immediately, and then we can do the expensive, blocking version. A
quick, hack version of this changed my run time for A(3,8) from 17.5s
to 14.5s, saving about 17%; sigaltstack and sigprocmask are still in
the picture, because they're called from
scm_catch_with_pre_unwind_handler. I'll work up a nicer patch later.
(2) We're locking a global mutex to check if there's an async item
pending for the current thread. First, why not a per-thread mutex so
we don't have to potentially stop *all* threads just to see if the
current one has async work to do? But second, it looks to me like by
far the common case will be checking the async queue when it's empty
-- not changing it, and not checking it when it's non-empty. If
that's the case, can we use some other mechanism, perhaps atomic queue
manipulation operations (like Apple's OSAtomic{Enqueue,Dequeue}) or
atomic compare-and-swap, to speed up the common case? It introduces
some portability problems, though, and if pthread_mutex_trylock helps
enough maybe it doesn't matter.
(3) My four-year-old comments on scm_enter/leave_guile, recorded in
threads.c around line 300, still stand.... Those functions really
ought to go away. At least they're confined to one file, now. Some
of it looks a little messy, but I can probably get rid of some of the
uses....
(4) Yeah, yeah, I should run gprof already. :-)
Ken