On Tue, 2010-10-12, Johan Corveleyn wrote: > On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <julian.f...@wandisco.com> > wrote: > > On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote: > >> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <julian.f...@wandisco.com> > >> wrote: > >> > On Sat, 2010-10-09, Johan Corveleyn wrote: > >> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.f...@wandisco.com> > >> >> wrote: > >> >> > So I wrote a patch - attached - that refactors this into an array of 4 > >> >> > sub-structures, and simplifies all the code that uses them. > >> > [...] > >> >> Yes, great idea! That would indeed vastly simplify a lot of the code. > >> >> So please go ahead and commit the refactoring. > >> > > >> > OK, committed in r1021282. > >> > >> Thanks, looks much more manageable now. > > > > I'd like to see a simplified version of your last patch, taking > > advantage of that, before you go exploring other options. > > Yes, I'll certainly do that in the coming days. > > >> >> Also, maybe the last_chunk number could be included in the file_info > >> >> struct? Now it's calculated in several places: "last_chunk0 = > >> >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on > >> >> every time as an extra argument. Seems like the sort of info that > >> >> could be part of file_info. > >> > > >> > It looks to me like you only need to calculate it in exactly one place: > >> > in increment_pointer_or_chunk(). I wouldn't worry about pre-calculating > >> > for efficiency, as it's a trivial calculation. > >> > >> Hm, I don't know. That's recalculating it for every byte in the > >> prefix. In the files I'm testing there could be 1 Mb or more of > > > > No, no, not every byte - only 1 in 131072 of the calls need to calculate > > it - only when it reaches the end of a block. > > Oh right. I guess it's ok then. > > > May I ask whether you've tested the code that crosses from one chunk to > > another? I noticed the following, just now: > > > >> + *at_least_one_end_reached = > >> + file_baton->curp[idx0] == file_baton->endp[idx0] > >> + || file_baton->curp[idx1] == file_baton->endp[idx1]; > >> + } > >> + > >> + /* If both files reached their end (i.e. are fully identical), > >> we're done */ > >> + if (file_baton->curp[idx0] == file_baton->endp[idx0] > >> + && file_baton->curp[idx1] == file_baton->endp[idx1]) > > > > Those tests seem to be saying "if we're at the end of the current chunk > > then we're at the end of the file". That looks wrong. What do you > > think? > > Well, it does work :-). The code is correct, but it's kind of > difficult to see, so that's my fault. > > The reason is that increment_pointer_or_chunk takes care that curp > never equals endp (if curp==endp-1 it reads the next chunk), unless it > reaches end of file. See this snippet of increment_pointer_or_chunk: > + if (*chunk_number == last_chunk_number) > + (*curp)++; /* *curp == *endp with last chunk signals end of file */ > > If you think this is too obscure to handle this, I agree, but I > couldn't think of a better/easier way at that time. If you have any > suggestions, feel free :-). But maybe I should refactor the patch > first, so it's easier to see how this would work out best. > > The same "obscurity" also applies to reaching "beginning of file" > (which can happen in decrement_pointer_or_chunk, either during prefix > scanning (rolling back the first line) or during suffix scanning (if > there was no prefix, but still a difference in the first line)). There > it's done by setting the chunk number to -1, which is checked for in > find_identical_prefix and find_identical_suffix: > + if (*chunk_number == 0) > + (*chunk_number)--; /* *chunk_number == -1 signals beginning of file > */ > > Not ideal, but it works ...
Oh OK, thanks for explaining. No further comment, m'lord :-) - Julian