On Wed, Nov 18, 2009 at 12:41:37AM +0200, Daniel Shahaf wrote: > Stefan Sperling wrote on Tue, 17 Nov 2009 at 09:16 +0100: > > On Mon, Nov 16, 2009 at 11:31:12PM +0200, Daniel Shahaf wrote: > > > Stefan Sperling wrote on Mon, 16 Nov 2009 at 14:39 +0100: > > > > Let C represent the current index into the lists of candidate offsets. > > > > Set C = 1 (indexing the first candidate offset). > > > > Iterate over the list of hunks. > > > > If the hunk is already marked applied or rejected, ignore it. > > > > If the hunk's Cth candidate offset is not contained in any range > > > > described > > > > by the list of occupied lines, store its current candidate offset and > > > > the > > > > length of its original text in the list of occupied line ranges and mark > > > > the hunk as applied. > > > > Else, if the hunk is not marked applied, and its current candidate > > > > offset > > > > is within the range of occupied lines, defer handling of this hunk to > > > > the > > > > next iteration. > > > > If the hunk has no Cth candidate offset, mark the hunk as rejected. > > > > Increment C and iterate. > > > > > > > > > > i.e., iterate all 1st candidates, then all 2nd candidates, etc. Why not > > > iterate by hunks first? (i.e., first all candidates of 1st hunk, then > > > all candidates of 2nd hunk, etc.) > > > > We already know what the best candidate offset for each hunk is, > > so why iterate over the candidates without considering how placing > > the current hunk at the current candidate offset affects other hunks? > > > > The point of this step is to prevent hunks from occupying overlapping > > line ranges. Any 2 hunks read from the patch cannot have occupied > > overlapping space in the original file, otherwise they would be a single, > > larger, hunk. > > > > Sorry, not sure I follow... (it's a little late where I am) > > But, thinking for a moment. What are we arguing about? Isn't the > typical case only 1-2 candidate offsets per hunk? (in which case, your > scheme and my suggestion reduce to the same thing)
That's probably the typical case, yes, but let's assume a non-typical case (a huge number of possible matches for each hunk) and see how the algorithm copes. I'm arguing that iterating over hunks first is missing the point, and that it does not amount to the same thing as I was proposing. Recall that we do two sorting steps: The first sort sorts the candidate offset list for each hunk in turn. It makes sure that, for a given hunk, we prefer the "best" candidate offset, which is close to the original offset (Julian pointed out that there may be better criteria than "how much does the candidate offset differ from the original offset?", and the sorting could happen with such criteria in mind.) So we know that whichever offset comes first in the candidate list, it is the "best" one for _this_ hunk. But we don't know yet if it can be used without disturbing another "even better" hunk which could also occupy all or part of the same range of lines. So the second sort sorts the hunk list, putting hunks which have the "best" candidate offsets first (again, "how much does the offset differ from the original line offset?" is one way of doing it, there may be more). So you were asking why, in the following step, we loop over candidate offsets, considering each hunk in turn as we do so, rather than looping over hunks, considering each candidate offset as we do so. The answer is that we want to end up using the best possible offset for every hunk. Looking at hunks in isolation does not achieve this goal. When we pick a candidate offset for a hunk, we want to make sure the resulting line range occupied by the hunk does not overlap with a range we assigned to a different hunk earlier. If the range is already occupied, we know that it is occupied by a "better" hunk (because the hunk list is already sorted accordingly) and that ignoring the current candidate offset for the current hunk will give the best overall result. What you propose does not answer any interesting question. Iterating over hunks first and looking at each candidate offset for the current hunk in turn does not give us any new information because we already sorted the candidate offset list in the best way we can for the current hunk. Is it more clear now? Stefan