> compared to a current CVS version of: > 5000 generations in 90.813806 seconds. 55.057708 generations/sec > A total of 32768 bytes were allocated > A total of 130932 DOD runs were made > A total of 10930 collection runs were made > Copying a total of 57801936 bytes > > so a 97% decrease in the number of collection runs only turns into an 11% > improvement in total performance.
This prompted me to setup Parrot for profiling in VTune here on my machine. What follows is a rather longish email, for which I apologize. Lots of it is due to statistics, which I'm including to allow people to draw their own conclusions. I know I'm not the only one that's profiled Parrot, but I've never really seen any results of Parrot profiling on the list. If anyone has a benchmark they'd like me to run for them, I'm willing to do that. I think Brent might have wanted this before, back before I had set this up. in reverse order of frequency for the more popular functions: Total Count Function 40500000 singlebyte_skip_forward 20250000 singlebyte_decode 18000000 Parrot_add_i_i_i 11260238 add_header_to_free 11260073 Parrot_allocate 11260071 new_string_header 11245053 Parrot_string_make 10130030 Parrot_string_length 10130003 Parrot_set_i_ic 10125000 Parrot_substr_s_s_i_ic 10125000 Parrot_string_substr 10125000 Parrot_string_compare 9000000 Parrot_mod_i_i_i 9000000 Parrot_ne_s_sc_ic 2189577 Parrot_branch_ic (yes, we perform the extremely simple singlebyte_skip_forward 40 million times on a 5000 generation life.pasm) In order of total time spent in the function ALONE, we have: Time Function 3004 runops_fast_core 2182 Parrot_string_substr 2039 singlebyte_skip_forward 2018 Parrot_string_compare 1212 Parrot_string_make 1125 Parrot_add_i_i_i 1088 Parrot_mod_i_i_i 925 new_string_header 904 Parrot_substr_s_s_i_ic 836 Parrot_ne_s_sc_ic 828 add_header_to_free 737 singlebyte_decode 688 Parrot_allocate 557 Parrot_set_i_ic 534 mark_PMCs_unused 453 free_unused_buffers 440 Parrot_string_concat 393 Parrot_string_length 354 mark_buffers_unused 202 Parrot_concat_s_sc Finally, for total time taken spent in a function and it's called functions, we have: Total Time Function 9469 Parrot_substr_s_s_i_ic 8564 Parrot_string_substr 5842 Parrot_string_make 4145 Parrot_ne_s_sc_ic 3679 new_string_header 3668 Parrot_string_compare 2753 alloc_more_string_headers 2716 Parrot_do_dod_run 2039 singlebyte_skip_forward 1628 Parrot_concat_s_sc 1425 Parrot_string_concat 1281 free_unused_buffers 1125 Parrot_add_i_i_i 1088 Parrot_mod_i_i_i 956 Parrot_allocate 828 add_header_to_free 737 singlebyte_decode 557 Parrot_set_i_ic While life normally took 7 seconds to execute, performing a 'call graph profiling session' took around 287 seconds. The numbers you see above are in milliseconds, and are obviously inflated as compared to an unprofiled execution. The profiler lists the total time as 21873 milliseconds. Due to the amount of string manipulation that life.pasm performs, this is obviously the bottleneck here. It's interesting to note that GC's DOD and collection show up, but don't really rate that high, which corresponds to Peter's findings. The most expensive functions, in terms of time spent within that function, are the ones that are called the most often, regardless of how simple they are. I wasn't sure how much of this is due to profiling overhead involved in tracking function calls, and how much is indicative of performance problems. To test this, since this example only uses singlebyte encoding, I inlined the contents of the encoding functions (singlebyte_skip_forward and singlebyte_decode) and tried it again. Using encodings for string_substr and string_compare: 5000 generations in 7.071001 seconds. 707.113457 generations/sec Using hardcoded singlebyte-ness for them: 5000 generations in 6.348999 seconds. 787.525716 generations/sec Or roughly 10%, due to the need to support encodings. This is in comparison to the profiler's listing 12% of the time being spent in the those two encoding functions. So it seems pretty accurate, as far as where time is going. Few things immediately come to mind: a) with the current encoding system, we're guaranteed to be slower than without it. If we want Parrot to be as fast as Perl5, we're deluding ourselves. b) the results will probably be worse in more string-heavy applications, such as regexes, etc. c) there's gotta be a way to have multiple encodings without sacrificing that much speed One approach that came to mind was macros (gah! no! erg!). An equivalent system would be preprocessing of .c files to produce .c files for the compiler, via some perl script. Imagine if we put substring, compare, etc, etc in the encodings vtable. This would result in code duplication due to much of the logic being duplicated, but would allow the compiler to heavily-optimize things for fixed-width encodings such as singlebyte and/or utf32. We can either put special tags in our string.c that allows it to get preprocessed to plugin stubs for the various decode, skip_forward, etc that are different for each encoding. A macro implementation would be something like: #define ENCODING singlebyte #define DECODE(a) (*(const byte_t *)(a)) #define SKIP_FORWARD(a,b) ((const byte_t *)(a) + (b)) #include string.c #define ENCODING utf32 ..... where string.c is something like: void ENCODING##_string_compare(..) { while( ... ) { INTVAL c1 = DECODE(s1start); INTVAL c2 = DECODE(s2start); cmp = c1 - c2; s1start = SKIP_FORWARD(s1start,1); s2start = SKIP_FORWARD(s2start,1); } } Which would compile down into optimized code. Then, when we perform a comparison, we'd simply delegate to the string's vtable. This would require transcoding for dual-string functions like compare (ugh), but would be rather simple for stuff like substr. Alternately, we could make a 2d-array for encoding1,encoding2, and dispatch into that table appropriately. This causes the number of string function compilations to go up as N^2 with the N=number of encodings, however. (More multidispatch in disguise! ;) Re-running the profiling, with the encoding stuff inlined, gives the following. It would probably be rather close, speedwise, to a version of parrot where all the string functions were put into the encodings vtable. There'd be one method table lookup to get the string function, instead of the simple static function call. But since the contents would be able to be inlined, instead of performing multiple indirections to the encodings vtable, there should still be a definite speedup. Total Count Function 18000000 Parrot_add_i_i_i 11260238 add_header_to_free 11260073 Parrot_allocate 11260071 new_string_header 11245053 Parrot_string_make 10130030 Parrot_string_length 10130003 Parrot_set_i_ic 10125000 Parrot_string_substr 10125000 Parrot_substr_s_s_i_ic 10125000 Parrot_string_compare 9000000 Parrot_ne_s_sc_ic 9000000 Parrot_mod_i_i_i 2189577 Parrot_branch_ic 2094880 buffer_lives 1490680 Parrot_inc_i 1125015 Parrot_string_concat 1125000 Parrot_lt_i_i_ic 1125000 Parrot_concat_s_sc 1125000 Parrot_eq_s_sc_ic 1079915 Parrot_eq_i_ic_ic 261862 mark_used Time Function 1758 runops_fast_core 1204 Parrot_string_substr 735 Parrot_mod_i_i_i 706 Parrot_string_make 660 Parrot_add_i_i_i 560 new_string_header 558 Parrot_substr_s_s_i_ic 508 Parrot_string_compare 469 Parrot_ne_s_sc_ic 468 Parrot_allocate 444 add_header_to_free 379 mark_PMCs_unused 379 Parrot_string_concat 316 mark_buffers_unused 300 Parrot_set_i_ic 215 free_unused_buffers Total Time Function 4985 Parrot_substr_s_s_i_ic 4426 Parrot_string_substr 3612 Parrot_string_make 2252 new_string_header 1691 alloc_more_string_headers 1673 Parrot_do_dod_run 1046 Parrot_concat_s_sc 949 Parrot_string_concat 932 Parrot_ne_s_sc_ic 735 Parrot_mod_i_i_i 660 Parrot_add_i_i_i 659 free_unused_buffers 657 Parrot_allocate 508 Parrot_string_compare 444 add_header_to_free 379 mark_PMCs_unused 316 mark_buffers_unused 300 Parrot_set_i_ic 188 Parrot_go_collect Re-interpreting these results, the next place to fix up would be substr itself. This is likely slow due to the various if statements in there for protection, and features (ie, negative offsets, etc). I'm sure Peter's COW stuff would help eliminate some of the overhead within that function for copying, although I couldn't be sure. Beyond that, very simple functions like add_i_i_i, etc are slow, which is something that we can't do much about. The two solutions are computed goto (I don't know how to run VTune with GCC, if that's even possible), and JIT, both of which eliminate the overhead for all these smallish functions. I could probably apply the JIT patch which supports Win32 and reprofile it. I'm not sure if/how well that'd work, tho. Thoughts? Mike Lambert