Stefan Sperling wrote:
> Julian asked for a formal description of how 'svn patch' applies
> hunks. Below are my current plans. The current code does something
> quite different and will be changed to implement what is described
> below before 1.7 release.
> 
> Any feedback appreciated.

Hi Stefan.

It's an interesting subject to think about: how best to apply a patch.

First, a few specific thoughts on your doc, but don't get too involved
in them until you read my concluding remarks :-)


> This note is a high-level description of how 'svn patch' applies hunks
> from a patch file to a file in the working copy.
> 
> First, the unidiff input is parsed by libsvn_diff, and passed to the
> patch application code as a list of patches, each containing a list of
> hunks to be applied to the patch target.

Call the patch for each file a "file-patch", as the plain word "patch"
is ambiguous.

> The patch application code in libsvn_client is shielded from the specifics
> of the unidiff file format. All it really cares about is:
>  1) The 'patch target', which is the file to be patched.
>  2) The 'original' and 'modified' texts of each hunk in the patch,
>     and for each hunk, the line offset at which the original text
>     should be replaced with the modified text.
> 
> The 'original' text is what the lines described by the hunk looked like
> before they were edited by the person who created the patch.
> The 'modified' text is what the lines described by the hunk looked like
> after editing.
> Both texts include surrounding lines of context obtained from the unidiff.
> 
> Each hunk has an 'original' line offset, described in the hunk header.
> This line offset describes where the lines described by the hunk were
> located in the original, i.e. unpatched, file. This is the preferred
> location for applying the hunk.

Rather than preferring to apply it in the target file at the line number
it had in the original file, I think you should aim to do better. One
way would be by scaling that number by (length of target) / (length of
original). Otherwise if the target has been modified by inserting or by
removing many lines evenly spread throughout, then the "closeness"
measurement would favour matches near the beginning of the file where
the line numbers have not changed much, over the end of the file where
line numbers may have changed by thousands.

A better evolvement of that would be by interpolating the line offset of
the next still-to-be-placed hunk between the nearest preceding and
following hunks that have already been matched with a high certainty.

> The goals of the hunk-application algorithm are:
>  1) Memory usage is bound by the longest line occurring in the patch file
>     or the patch target. Never even attempt to read an entire file into RAM.

I don't see that goal (1) is compatible with any interpretation of the
algorithm. I think you need to keep at least a whole hunk of the patch,
and at least a corresponding amount of the target file, in memory at a
minimum.

>  2) File seeks should be kept at a minimum.

Sure, but obvious and vague unless you say what you could compromise
aginst it.

>  3) Hunks matching at or near their original offset in the target file are
>     always preferred over hunks matching at the same location but with a
>     greater offset relative to their original offset.

Would I be right in guessing that "matching at the same location but
with a greater offset relative to their original offset" means "matching
further away"? And see above for comment about adjusting the original
offset in proportion with known extremes of the target file (start and
end, or nearest certainly-matching hunks).

Important: I think the criteria for quality of match should include how
much text in the context matches, and how much text in the original side
of the hunk body matches. A hunk with a large total amount of matching
text should be preferred over one with small amount such as empty
context lines or lines containing just "{" or "endif". I think the
"amount of matching" should be of considerably more importance than
line-number offset from the expected location.

>  4) Every hunk in the patch gets a chance to be applied, independent of
>     matching results for hunks which were matched before it or will be
>     matched after it. Hunks are rejected only if they don't match anywhere
>     in the file, or if they collide with another hunk which is a better
>     match according to 3). This implies that:
>     4a) Hunks are not assumed to be listed in any particular order.
>     4b) The offsets listed in hunk headers are only used as a guide,
>         not as an authoritative reference.
>     4c) It is possible for a hunk to match at multiple candidate locations,
>         but it will only be applied at one of them.
>   5) Hunks are only applied if they do not overlap with any other hunk.
>   6) By default, a patch is either applied fully or not at all.
>      Rejection of hunks is only allowed if the user has given permission.
>      If the user not given such permission, attempting to apply a patch
>      with rejected hunks will raise an error and leave the working copy
>      unmodified.
> 
> A patch may want to change more than one target file.
> 
> Each patch target is scanned line-by-line, and possible matches for each
> of the patch's hunks are determined by matching the 'original' hunk text
> to the patch target at the current line offset. This results in a list
> of candidate offsets, one such list for each hunk.
> 
> For each hunk, its lists of candidate offsets is now sorted such that
> offsets closest to the hunk's original offset (as parsed from the hunk
> header in the patch file) come first.
> If two candidate offsets have the same distance from the original offset,
> the order they are listed in is undefined (depending on the sorting
> algorithm used, the result might be deterministic, but the hunk application
> algorithm does not rely on this).
> 
> Next, the list of hunks is sorted, such that hunks whose first candidate
> offset is closest to the hunk's original offset come first.
> 
> Next, the algorithm builds a list of line ranges occupied by hunks which
> were successfully applied.
> In each item of this list, the occupying hunk's line offset as well as
> the number of lines in the hunk's original text is stored, specifying
> the range of lines occupied by the hunk.
> 
> Note that the minimum size of a line range depends on the amount
> of context provided in the unidiff. For standard unidiff files with 3
> lines of leading and trailing context, the minimum length of the original
> text is 6 lines (e.g. consider a hunk adding a single line to the file).
> 
> Let C represent the current index into the lists of candidate offsets.
> Set C = 1 (indexing the first candidate offset).
> Iterate over the list of hunks.
> If the hunk is already marked applied or rejected, ignore it.
> If the hunk's Cth candidate offset is not contained in any range described
> by the list of occupied lines, store its current candidate offset and the
> length of its original text in the list of occupied line ranges and mark
> the hunk as applied.
> Else, if the hunk is not marked applied, and its current candidate offset
> is within the range of occupied lines, defer handling of this hunk to the
> next iteration.
> If the hunk has no Cth candidate offset, mark the hunk as rejected.
> Increment C and iterate.
> 
> After this step, all hunks are marked either applied or rejected.
> If any hunks were rejected and rejection is not allowed, raise an error.
> 
> Repeat the above process for the next patch target.
> 
> Once all patch targets are processed as above, read each target file again
> from the top, copying lines not contained in the list of lines occupied by
> hunks verbatim. Replace any lines occupied by a hunk with lines of the hunk's
> modified text.
> (Note that the number of lines in the modified text of a hunk may differ
> from the number of lines in the original text of a hunk, so the patched
> file can have a different number of lines than the original file. All offsets
> are relative to the original file, however.)
> 
> Write any rejected hunks for this target to a reject file.
> 
> The patch is now applied.
> 
> For example, consider the hunks H1 and H2:
>   H1.orig_off = 40
>   H1.orig_len = 10
> 
>   H2.orig_off = 19
>   H2.orig_len = 6
> 
>   H3.orig_off = 2
>   H3.orig_len = 7
> 
> Imagine we found the following candidate offsets for H1 and H2:
>   H1.cand_off = [ 15, 25, 45 ]
>   H2.cand_off = [ 10, 20, 30, 50 ]
>   H3.cand_off = [ 51 ]
> 
> Sorting the candidate lists according to their distance from the
> original offsets, we get:
> 
>   H1.cand_off = [ 45, 25, 15 ]
>   H2.cand_off = [ 20, 10, 30, 50 ]
>   H3.cand_off = [ 51 ]
> 
> Sorting the list of hunks according to their first candidate's distance
> to their original offfset, we get:
> 
>   H2.orig_off = 19
>   H2.cand_off = [ 20, 10, 30, 50 ]
> 
>   H1.orig_off = 40
>   H1.cand_off = [ 45, 25, 15 ]
> 
>   H3.orig_off = 2
>   H3.cand_off = [ 51 ]
> 
> During the next step, H2 will be marked as applied in range [20, 26],
> and H1 will be marked as applied in range [45, 55], and H3 will be
> rejected since it only matches in a range already occupied by H2.

Concluding remarks:

I have no idea how effective the algorithm above will be. I will
continue to think about how to do patching well, over the remaining
years of my life. In the meantime, to do this well will take a large
amount of effort.

Is there a way you could take advantage of the work that has been done
before, such as openBSD's patch(1) source code? Then we wouldn't have to
concern ourselves with working out how to do it and analyzing the
algorithm.

- Julian


Reply via email to