On Tue, Dec 28, 2010 at 7:31 PM, Johan Corveleyn <jcor...@gmail.com> wrote: > On Fri, Dec 24, 2010 at 3:40 PM, Stefan Fuhrmann <eq...@web.de> wrote: >> On 20.12.2010 02:43, Johan Corveleyn wrote: >>> >>> On Wed, Dec 15, 2010 at 10:58 AM, Stefan Fuhrmann<eq...@web.de> wrote: >>>> >>>> On 15.12.2010 02:30, Stefan Fuhrmann wrote: >>>>> >>>>> On 14.12.2010 23:35, Johan Corveleyn wrote: >>>>> >>>>>> Thoughts? >>>>> >>>>> Two things you might try: >>>>> * introduce a local variable for afile[i] >>>>> * replace the for() loop with two nested ones, keeping >>>>> calls to other functions out of the hot spot: >>>>> >>>>> for (i=0; i< file_len;) >>>> >>>> That should read: >>>> for (i=0; i< file_len; i++) >>>>> >>>>> { >>>>> /* hot spot: */ >>>>> for(; i< file_len; i++) >>>>> { >>>>> curFile = afile[i]; >>>>> if (curFile->chunk==-1) >>>>> curFile->chunk = 0; >>>>> else if (curFile->curp != curFile->endp -1) >>>>> curFile->curp++; >>>>> else >>>>> break; >>>>> } >>>>> >>>>> if (i< file_len) >>>>> { >>>>> /* the complex, rarely used stuff goes here */ >>>>> } >>>>> } >>> >>> Ok, I tried this, but it didn't really help. It's still only about as >>> fast as before. While the macro version is about 10% faster (I made a >>> cleaner macro version, only macro'ing the hotspot stuff, with a >>> function call to the complex, rarely used stuff). >>> >>> For the inline version, I tried several variations (always with >>> APR_INLINE in the function signature, and with local variable for >>> file[i]): with or without the nested for loop, with or without the >>> complex stuff factored out to a separate function, ... it made no >>> difference. >>> >>> Maybe I'm still doing something wrong, keeping the compiler from >>> inlining it (but I'm not really a compiler expert, maybe I have to add >>> some compiler options or something, ...). OTOH, I now have a macro >>> implementation which is quite readable (and which uses the do ... >>> while(0) construct - thanks Peter), and which does the trick. So maybe >>> I should simply stick with that option ... >> >> The key factor here is that file_len is only 2. >> Basically, that will make my optimization a >> pessimization. >> >> Assuming that for most calls, curp of *both* >> files will be somewhere *between* BOF and >> EOF, the unrolling the loop may help: >> >> #define INCREMENT_POINTERS >> >> /* special code for the common case. 10 .. 12 ticks per execution */ >> >> static APR_INLINE svn_boolean_t >> quickly_increment_pointers(struct file_info *afile[]) >> { >> struct file_info *file0 = afile[0]; >> struct file_info *file1 = afile[1]; >> if ((afile0->chunk != -1) && (afile1->chunk != -1)) >> { >> /* suitable_type */ nextp0 = afile0->curp + 1; >> /* suitable_type */ nextp1 = afile1->curp + 1; >> if ((nextp0 != afile0->endp) && (nextp1 != afile1->endp)) >> { >> afile0->curp = nextp0; >> afile1->curp = nextp1; >> return TRUE; >> } >> } >> return FALSE; >> } >> >> /* slow code covering all special cases */ >> >> static svn_error_t* >> slowly_increment_pointers(struct file_info *file[], apr_size_t file_len, >> apr_pool_t *pool) >> { >> for (i = 0; i < file_len;i++) >> ... >> } >> >> /* maybe as macro: */ >> return ((file_len == 2) && quickly_increment_pointers (file)) >> ? SVN_NO_ERROR >> : slowly_increment_pointers(file, file_len, pool); > > Nice! I will try this out sometime soon, but I'm not so sure it will > help more than the current macro version (only macro for the critical > section, calling a function for increment/decrement chunk - see > diff_file.c in HEAD of diff-optimizations-bytes branch). I guess I'll > just have to test it. > > With the macro implementation in place my gut feeling is that the > prefix/suffix scanning is reasonably optimal, and that the remaining > time spent in the diff code (70 seconds for my example blame of big > xml file with 2200 changes) is almost all due to the "original" diff > algorithm (working on the remainder between prefix and suffix). But to > be sure I should probably measure that first.
Hmmm, my gut feeling seems to be a little bit off. I measured this, and prefix/suffix scanning is still taking 46 - 50 seconds of the total 70 seconds spent in "diff-time" during the blame of my example file (~20 seconds are going to "getting tokens and inserting them into the token tree", and ~5 seconds in the actual LCS algorithm). So optimizations here will probably still be very useful. I'm going to let this rest for a while, until I get the prefix/suffix scanning work for diff3 and diff4. After I get that all working, I might revisit this for more optimization ... Cheers, -- Johan