Here's a small running "internal dialogue" to demonstrate how the use of a profiler can help detect unexpected performance problems.
--- I was trying to trace a compilation performance issue in Whalesong. One of the things I've added to help me chase this is an internal debugging flag "--enable-profiling", that automatically runs Racket's statistical profiler as it's doing its work. So I run Whalesong on one of my programs and watch what comes out of the profiler. From earlier discussion, I run Racket without the JIT so that the profiler presents more useful information. ###### $ racket --no-jit ../../whalesong.rkt build --enable-profiling nucleic2.rkt ###### ... lots and lots of profiler output starts printing to screen. (It would be _very_ useful if the text renderer of the profiler were sorted in terms of self time, by the way! Can this be changed?) The profiler generates a sequence of "node" outputs, where each node represents a function call and its context. I start scanning down the "self" time column, which is how much time is being spent doing work in a particular function. One of the functions in the profile jumps out at me: ####################################################################### Profiling results ----------------- Total cpu time observed: 9144ms (out of 9268ms) Number of samples taken: 3595 (once every 3ms) ... [106] 1822(20.0%) 1822(20.0%) for-loop /home/dyoo/local/racket-5.2.1/lib/racket/collects/compiler/zo-parse.rkt:490:2 ####################################################################### Oh! About twenty percent of the time is being spent here, in parsing zo files! Whoa. Why? Let me take a look at that snippet of code. Here's is the surrounding context: ;; in zo-parse.rkt ;;;;;;;;;;;;;;;;; (define cpt-table ;; The "schcpt.h" mapping `([0 escape] [1 symbol] ;; ... ;; dyoo: omitting some output ;; ... [247 255 small-application])) (define (cpt-table-lookup i) (for/or ([ent (in-list cpt-table)]) (match ent [(list k sym) (and (= k i) (cons k sym))] [(list k k* sym) (and (<= k i) (< i k*) (cons k sym))]))) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; cpt-table-lookup defines the for-loop that the profiler reports as being a hot-spot. Oh, wait! The function looks odd. What cpt-table-lookup is doing is a lot of work, and I expected it to be a simple, constant table lookup. (In a sense, it is "constant", but the constant is larger than it should be.) Since cpt-table is itself a constant value, that leaves cpt-table-lookup ripe for an optimization: I can precompute a flat lookup table. I'll fix cpt-table-lookup by creating a vector, up-front, that flattens out the cpt-table. Here's what that looks like: ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; (define cpt-table-as-vector (make-vector 256)) (for ([ent (in-list cpt-table)]) (match ent [(list k sym) (vector-set! cpt-table-as-vector k (cons k sym))] [(list k k* sym) (for ([i (in-range k k*)]) (vector-set! cpt-table-as-vector i (cons k sym)))])) (define (cpt-table-lookup i) (vector-ref cpt-table-as-vector i)) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; Ok, let's see what the profiler tells me now: ############################################ Profiling results ----------------- Total cpu time observed: 7052ms (out of 7148ms) Number of samples taken: 2812 (once every 3ms) ############################################ Sweet. That helped cut it down from 9 seconds to 7. The node associated to the cpt-table-lookup function doesn't show up in the profile anymore, either. About a 20% performance improvement. Cool! The effect of this optimization isn't so pronounced when Racket is running under its JIT. Under the JIT, my program took 4.3 seconds before the optimization, and 3.9 seconds after the optimization, giving me a 9% improvement. Still, not too bad. I'll send a patch to compiler/zo-parse later. But I wanted to share the importance of the profiler: I would not have easily pinpointed zo-parse without it. ____________________ Racket Users list: http://lists.racket-lang.org/users