Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Johan Corveleyn Thu, 30 Dec 2010 13:33:42 -0800

On Tue, Dec 28, 2010 at 7:31 PM, Johan Corveleyn <jcor...@gmail.com> wrote:
> On Fri, Dec 24, 2010 at 3:40 PM, Stefan Fuhrmann <eq...@web.de> wrote:
>> On 20.12.2010 02:43, Johan Corveleyn wrote:
>>>
>>> On Wed, Dec 15, 2010 at 10:58 AM, Stefan Fuhrmann<eq...@web.de>  wrote:
>>>>
>>>> On 15.12.2010 02:30, Stefan Fuhrmann wrote:
>>>>>
>>>>> On 14.12.2010 23:35, Johan Corveleyn wrote:
>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> Two things you might try:
>>>>> * introduce a local variable for afile[i]
>>>>> * replace the for() loop with two nested ones, keeping
>>>>>  calls to other functions out of the hot spot:
>>>>>
>>>>> for (i=0; i<  file_len;)
>>>>
>>>> That should read:
>>>> for (i=0; i<  file_len; i++)
>>>>>
>>>>> {
>>>>>  /* hot spot: */
>>>>>  for(; i<  file_len; i++)
>>>>>  {
>>>>>    curFile = afile[i];
>>>>>    if (curFile->chunk==-1)
>>>>>      curFile->chunk = 0;
>>>>>    else if (curFile->curp != curFile->endp -1)
>>>>>      curFile->curp++;
>>>>>    else
>>>>>      break;
>>>>>  }
>>>>>
>>>>>  if (i<  file_len)
>>>>>  {
>>>>>    /* the complex, rarely used stuff goes here */
>>>>>  }
>>>>> }
>>>
>>> Ok, I tried this, but it didn't really help. It's still only about as
>>> fast as before. While the macro version is about 10% faster (I made a
>>> cleaner macro version, only macro'ing the hotspot stuff, with a
>>> function call to the complex, rarely used stuff).
>>>
>>> For the inline version, I tried several variations (always with
>>> APR_INLINE in the function signature, and with local variable for
>>> file[i]): with or without the nested for loop, with or without the
>>> complex stuff factored out to a separate function, ... it made no
>>> difference.
>>>
>>> Maybe I'm still doing something wrong, keeping the compiler from
>>> inlining it (but I'm not really a compiler expert, maybe I have to add
>>> some compiler options or something, ...). OTOH, I now have a macro
>>> implementation which is quite readable (and which uses the do ...
>>> while(0) construct - thanks Peter), and which does the trick. So maybe
>>> I should simply stick with that option ...
>>
>> The key factor here is that file_len is only 2.
>> Basically, that will make my optimization a
>> pessimization.
>>
>> Assuming that for most calls, curp of *both*
>> files will be somewhere *between* BOF and
>> EOF, the unrolling the loop may help:
>>
>> #define INCREMENT_POINTERS
>>
>> /* special code for the common case. 10 .. 12 ticks per execution */
>>
>> static APR_INLINE svn_boolean_t
>> quickly_increment_pointers(struct file_info *afile[])
>> {
>> struct file_info *file0 = afile[0];
>> struct file_info *file1 = afile[1];
>> if ((afile0->chunk != -1) && (afile1->chunk != -1))
>>  {
>>    /* suitable_type */ nextp0 = afile0->curp + 1;
>>    /* suitable_type */ nextp1 = afile1->curp + 1;
>>    if ((nextp0 != afile0->endp) && (nextp1 != afile1->endp))
>>      {
>>        afile0->curp = nextp0;
>>        afile1->curp = nextp1;
>>        return TRUE;
>>      }
>>  }
>> return FALSE;
>> }
>>
>> /* slow code covering all special cases */
>>
>> static svn_error_t*
>> slowly_increment_pointers(struct file_info *file[], apr_size_t file_len,
>> apr_pool_t *pool)
>> {
>>  for (i = 0; i < file_len;i++)
>> ...
>> }
>>
>> /* maybe as macro: */
>> return ((file_len == 2) && quickly_increment_pointers (file))
>>  ? SVN_NO_ERROR
>>  : slowly_increment_pointers(file, file_len, pool);
>
> Nice! I will try this out sometime soon, but I'm not so sure it will
> help more than the current macro version (only macro for the critical
> section, calling a function for increment/decrement chunk - see
> diff_file.c in HEAD of diff-optimizations-bytes branch). I guess I'll
> just have to test it.
>
> With the macro implementation in place my gut feeling is that the
> prefix/suffix scanning is reasonably optimal, and that the remaining
> time spent in the diff code (70 seconds for my example blame of big
> xml file with 2200 changes) is almost all due to the "original" diff
> algorithm (working on the remainder between prefix and suffix). But to
> be sure I should probably measure that first.


Hmmm, my gut feeling seems to be a little bit off. I measured this,
and prefix/suffix scanning is still taking 46 - 50 seconds of the
total 70 seconds spent in "diff-time" during the blame of my example
file (~20 seconds are going to "getting tokens and inserting them into
the token tree", and ~5 seconds in the actual LCS algorithm). So
optimizations here will probably still be very useful.

I'm going to let this rest for a while, until I get the prefix/suffix
scanning work for diff3 and diff4. After I get that all working, I
might revisit this for more optimization ...

Cheers,
-- 
Johan

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Reply via email to