> compared to a current CVS version of:
> 5000 generations in 90.813806 seconds. 55.057708 generations/sec
> A total of 32768 bytes were allocated
> A total of 130932 DOD runs were made
> A total of 10930 collection runs were made
> Copying a total of 57801936 bytes
>
> so a 97% decrease in the number of collection runs only turns into an 11%
> improvement in total performance.

This prompted me to setup Parrot for profiling in VTune here on my
machine. What follows is a rather longish email, for which I
apologize. Lots of it is due to statistics, which I'm including to allow
people to draw their own conclusions. I know I'm not the only one that's
profiled Parrot, but I've never really seen any results of Parrot
profiling on the list. If anyone has a benchmark they'd like me to run for
them, I'm willing to do that. I think Brent might have wanted this before,
back before I had set this up.

in reverse order of frequency for the more popular functions:
Total Count     Function
40500000        singlebyte_skip_forward
20250000        singlebyte_decode
18000000        Parrot_add_i_i_i
11260238        add_header_to_free
11260073        Parrot_allocate
11260071        new_string_header
11245053        Parrot_string_make
10130030        Parrot_string_length
10130003        Parrot_set_i_ic
10125000        Parrot_substr_s_s_i_ic
10125000        Parrot_string_substr
10125000        Parrot_string_compare
9000000         Parrot_mod_i_i_i
9000000         Parrot_ne_s_sc_ic
2189577         Parrot_branch_ic

(yes, we perform the extremely simple singlebyte_skip_forward 40 million
times on a 5000 generation life.pasm)

In order of total time spent in the function ALONE, we have:
Time    Function
3004    runops_fast_core
2182    Parrot_string_substr
2039    singlebyte_skip_forward
2018    Parrot_string_compare
1212    Parrot_string_make
1125    Parrot_add_i_i_i
1088    Parrot_mod_i_i_i
925     new_string_header
904     Parrot_substr_s_s_i_ic
836     Parrot_ne_s_sc_ic
828     add_header_to_free
737     singlebyte_decode
688     Parrot_allocate
557     Parrot_set_i_ic
534     mark_PMCs_unused
453     free_unused_buffers
440     Parrot_string_concat
393     Parrot_string_length
354     mark_buffers_unused
202     Parrot_concat_s_sc


Finally, for total time taken spent in a function and it's called
functions, we have:
Total Time      Function
9469    Parrot_substr_s_s_i_ic
8564    Parrot_string_substr
5842    Parrot_string_make
4145    Parrot_ne_s_sc_ic
3679    new_string_header
3668    Parrot_string_compare
2753    alloc_more_string_headers
2716    Parrot_do_dod_run
2039    singlebyte_skip_forward
1628    Parrot_concat_s_sc
1425    Parrot_string_concat
1281    free_unused_buffers
1125    Parrot_add_i_i_i
1088    Parrot_mod_i_i_i
956     Parrot_allocate
828     add_header_to_free
737     singlebyte_decode
557     Parrot_set_i_ic


While life normally took 7 seconds to execute, performing a 'call graph
profiling session' took around 287 seconds. The numbers you see above are
in milliseconds, and are obviously inflated as compared to an unprofiled
execution. The profiler lists the total time as 21873 milliseconds.

Due to the amount of string manipulation that life.pasm performs, this is
obviously the bottleneck here. It's interesting to note that GC's DOD and
collection show up, but don't really rate that high, which corresponds to
Peter's findings.

The most expensive functions, in terms of time spent within that function,
are the ones that are called the most often, regardless of how simple
they are. I wasn't sure how much of this is due to profiling overhead
involved in tracking function calls, and how much is indicative of
performance problems.

To test this, since this example only uses singlebyte encoding, I
inlined the contents of the encoding functions (singlebyte_skip_forward
and singlebyte_decode) and tried it again.

Using encodings for string_substr and string_compare:
5000 generations in 7.071001 seconds. 707.113457 generations/sec

Using hardcoded singlebyte-ness for them:
5000 generations in 6.348999 seconds. 787.525716 generations/sec

Or roughly 10%, due to the need to support encodings. This is in
comparison to the profiler's listing 12% of the time being spent in the
those two encoding functions. So it seems pretty accurate, as far as where
time is going.


Few things immediately come to mind:
a) with the current encoding system, we're guaranteed to be slower than
without it. If we want Parrot to be as fast as Perl5, we're deluding
ourselves.
b) the results will probably be worse in more string-heavy applications,
such as regexes, etc.
c) there's gotta be a way to have multiple encodings without sacrificing
that much speed

One approach that came to mind was macros (gah! no! erg!). An equivalent
system would be preprocessing of .c files to produce .c files for the
compiler, via some perl script.

Imagine if we put substring, compare, etc, etc in the encodings vtable.
This would result in code duplication due to much of the logic being
duplicated, but would allow the compiler to heavily-optimize things for
fixed-width encodings such as singlebyte and/or utf32. We can either put
special tags in our string.c that allows it to get preprocessed to
plugin stubs for the various decode, skip_forward, etc that are different
for each encoding.

A macro implementation would be something like:

#define ENCODING singlebyte
#define DECODE(a) (*(const byte_t *)(a))
#define SKIP_FORWARD(a,b) ((const byte_t *)(a) + (b))
#include string.c
#define ENCODING utf32 .....

where string.c is something like:
void ENCODING##_string_compare(..) {
  while( ... ) {
    INTVAL c1 = DECODE(s1start);
    INTVAL c2 = DECODE(s2start);
    cmp = c1 - c2;
    s1start = SKIP_FORWARD(s1start,1);
    s2start = SKIP_FORWARD(s2start,1);
  }
}

Which would compile down into optimized code. Then, when we perform a
comparison, we'd simply delegate to the string's vtable. This would
require transcoding for dual-string functions like compare (ugh), but
would be rather simple for stuff like substr. Alternately, we could make a
2d-array for encoding1,encoding2, and dispatch into that table
appropriately. This causes the number of string function compilations to
go up as N^2 with the N=number of encodings, however. (More multidispatch
in disguise! ;)


Re-running the profiling, with the encoding stuff inlined, gives the
following. It would probably be rather close, speedwise, to a version of
parrot where all the string functions were put into the encodings vtable.
There'd be one method table lookup to get the string function, instead of
the simple static function call. But since the contents would be able to
be inlined, instead of performing multiple indirections to the encodings
vtable, there should still be a definite speedup.

Total Count     Function
18000000        Parrot_add_i_i_i
11260238        add_header_to_free
11260073        Parrot_allocate
11260071        new_string_header
11245053        Parrot_string_make
10130030        Parrot_string_length
10130003        Parrot_set_i_ic
10125000        Parrot_string_substr
10125000        Parrot_substr_s_s_i_ic
10125000        Parrot_string_compare
9000000         Parrot_ne_s_sc_ic
9000000         Parrot_mod_i_i_i
2189577         Parrot_branch_ic
2094880         buffer_lives
1490680         Parrot_inc_i
1125015         Parrot_string_concat
1125000         Parrot_lt_i_i_ic
1125000         Parrot_concat_s_sc
1125000         Parrot_eq_s_sc_ic
1079915         Parrot_eq_i_ic_ic
261862          mark_used

Time    Function
1758    runops_fast_core
1204    Parrot_string_substr
735     Parrot_mod_i_i_i
706     Parrot_string_make
660     Parrot_add_i_i_i
560     new_string_header
558     Parrot_substr_s_s_i_ic
508     Parrot_string_compare
469     Parrot_ne_s_sc_ic
468     Parrot_allocate
444     add_header_to_free
379     mark_PMCs_unused
379     Parrot_string_concat
316     mark_buffers_unused
300     Parrot_set_i_ic
215     free_unused_buffers

Total Time      Function
4985    Parrot_substr_s_s_i_ic
4426    Parrot_string_substr
3612    Parrot_string_make
2252    new_string_header
1691    alloc_more_string_headers
1673    Parrot_do_dod_run
1046    Parrot_concat_s_sc
949     Parrot_string_concat
932     Parrot_ne_s_sc_ic
735     Parrot_mod_i_i_i
660     Parrot_add_i_i_i
659     free_unused_buffers
657     Parrot_allocate
508     Parrot_string_compare
444     add_header_to_free
379     mark_PMCs_unused
316     mark_buffers_unused
300     Parrot_set_i_ic
188     Parrot_go_collect


Re-interpreting these results, the next place to fix up would be substr
itself. This is likely slow due to the various if statements in there for
protection, and features (ie, negative offsets, etc). I'm sure Peter's COW
stuff would help eliminate some of the overhead within that function for
copying, although I couldn't be sure.

Beyond that, very simple functions like add_i_i_i, etc are slow, which is
something that we can't do much about. The two solutions are computed goto
(I don't know how to run VTune with GCC, if that's even possible), and
JIT, both of which eliminate the overhead for all these smallish
functions. I could probably apply the JIT patch which supports Win32 and
reprofile it. I'm not sure if/how well that'd work, tho.

Thoughts?
Mike Lambert


Reply via email to