Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Julian Foad Tue, 12 Oct 2010 06:06:24 -0700

On Tue, 2010-10-12, Johan Corveleyn wrote:
> On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <julian.f...@wandisco.com> 
> wrote:
> > On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
> >> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <julian.f...@wandisco.com> 
> >> wrote:
> >> > On Sat, 2010-10-09, Johan Corveleyn wrote:
> >> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.f...@wandisco.com> 
> >> >> wrote:
> >> >> > So I wrote a patch - attached - that refactors this into an array of 4
> >> >> > sub-structures, and simplifies all the code that uses them.
> >> > [...]
> >> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
> >> >> So please go ahead and commit the refactoring.
> >> >
> >> > OK, committed in r1021282.
> >>
> >> Thanks, looks much more manageable now.
> >
> > I'd like to see a simplified version of your last patch, taking
> > advantage of that, before you go exploring other options.
> 
> Yes, I'll certainly do that in the coming days.
> 
> >> >> Also, maybe the last_chunk number could be included in the file_info
> >> >> struct? Now it's calculated in several places: "last_chunk0 =
> >> >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
> >> >> every time as an extra argument. Seems like the sort of info that
> >> >> could be part of file_info.
> >> >
> >> > It looks to me like you only need to calculate it in exactly one place:
> >> > in increment_pointer_or_chunk().  I wouldn't worry about pre-calculating
> >> > for efficiency, as it's a trivial calculation.
> >>
> >> Hm, I don't know. That's recalculating it for every byte in the
> >> prefix. In the files I'm testing there could be 1 Mb or more of
> >
> > No, no, not every byte - only 1 in 131072 of the calls need to calculate
> > it - only when it reaches the end of a block.
> 
> Oh right. I guess it's ok then.
> 
> > May I ask whether you've tested the code that crosses from one chunk to
> > another?  I noticed the following, just now:
> >
> >> +      *at_least_one_end_reached =
> >> +        file_baton->curp[idx0] == file_baton->endp[idx0]
> >> +        || file_baton->curp[idx1] == file_baton->endp[idx1];
> >> +    }
> >> +
> >> +  /* If both files reached their end (i.e. are fully identical),
> >> we're done */
> >> +  if (file_baton->curp[idx0] == file_baton->endp[idx0]
> >> +        && file_baton->curp[idx1] == file_baton->endp[idx1])
> >
> > Those tests seem to be saying "if we're at the end of the current chunk
> > then we're at the end of the file".  That looks wrong.  What do you
> > think?
> 
> Well, it does work :-). The code is correct, but it's kind of
> difficult to see, so that's my fault.
> 
> The reason is that increment_pointer_or_chunk takes care that curp
> never equals endp (if curp==endp-1 it reads the next chunk), unless it
> reaches end of file. See this snippet of increment_pointer_or_chunk:
> +      if (*chunk_number == last_chunk_number)
> +        (*curp)++; /* *curp == *endp with last chunk signals end of file */
> 
> If you think this is too obscure to handle this, I agree, but I
> couldn't think of a better/easier way at that time. If you have any
> suggestions, feel free :-). But maybe I should refactor the patch
> first, so it's easier to see how this would work out best.
> 
> The same "obscurity" also applies to reaching "beginning of file"
> (which can happen in decrement_pointer_or_chunk, either during prefix
> scanning (rolling back the first line) or during suffix scanning (if
> there was no prefix, but still a difference in the first line)). There
> it's done by setting the chunk number to -1, which is checked for in
> find_identical_prefix and find_identical_suffix:
> +      if (*chunk_number == 0)
> +        (*chunk_number)--; /* *chunk_number == -1 signals beginning of file 
> */
> 
> Not ideal, but it works ...


Oh OK, thanks for explaining.  No further comment, m'lord :-)

- Julian

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Reply via email to