Julian asked for a formal description of how 'svn patch' applies
hunks. Below are my current plans. The current code does something
quite different and will be changed to implement what is described
below before 1.7 release.

Any feedback appreciated.

Stefan


This note is a high-level description of how 'svn patch' applies hunks
from a patch file to a file in the working copy.

First, the unidiff input is parsed by libsvn_diff, and passed to the
patch application code as a list of patches, each containing a list of
hunks to be applied to the patch target.

The patch application code in libsvn_client is shielded from the specifics
of the unidiff file format. All it really cares about is:
 1) The 'patch target', which is the file to be patched.
 2) The 'original' and 'modified' texts of each hunk in the patch,
    and for each hunk, the line offset at which the original text
    should be replaced with the modified text.

The 'original' text is what the lines described by the hunk looked like
before they were edited by the person who created the patch.
The 'modified' text is what the lines described by the hunk looked like
after editing.
Both texts include surrounding lines of context obtained from the unidiff.

Each hunk has an 'original' line offset, described in the hunk header.
This line offset describes where the lines described by the hunk were
located in the original, i.e. unpatched, file. This is the preferred
location for applying the hunk.

The goals of the hunk-application algorithm are:
 1) Memory usage is bound by the longest line occurring in the patch file
    or the patch target. Never even attempt to read an entire file into RAM.
 2) File seeks should be kept at a minimum.
 3) Hunks matching at or near their original offset in the target file are
    always preferred over hunks matching at the same location but with a
    greater offset relative to their original offset.
 4) Every hunk in the patch gets a chance to be applied, independent of
    matching results for hunks which were matched before it or will be
    matched after it. Hunks are rejected only if they don't match anywhere
    in the file, or if they collide with another hunk which is a better
    match according to 3). This implies that:
    4a) Hunks are not assumed to be listed in any particular order.
    4b) The offsets listed in hunk headers are only used as a guide,
        not as an authoritative reference.
    4c) It is possible for a hunk to match at multiple candidate locations,
        but it will only be applied at one of them.
  5) Hunks are only applied if they do not overlap with any other hunk.
  6) By default, a patch is either applied fully or not at all.
     Rejection of hunks is only allowed if the user has given permission.
     If the user not given such permission, attempting to apply a patch
     with rejected hunks will raise an error and leave the working copy
     unmodified.

A patch may want to change more than one target file.

Each patch target is scanned line-by-line, and possible matches for each
of the patch's hunks are determined by matching the 'original' hunk text
to the patch target at the current line offset. This results in a list
of candidate offsets, one such list for each hunk.

For each hunk, its lists of candidate offsets is now sorted such that
offsets closest to the hunk's original offset (as parsed from the hunk
header in the patch file) come first.
If two candidate offsets have the same distance from the original offset,
the order they are listed in is undefined (depending on the sorting
algorithm used, the result might be deterministic, but the hunk application
algorithm does not rely on this).

Next, the list of hunks is sorted, such that hunks whose first candidate
offset is closest to the hunk's original offset come first.

Next, the algorithm builds a list of line ranges occupied by hunks which
were successfully applied.
In each item of this list, the occupying hunk's line offset as well as
the number of lines in the hunk's original text is stored, specifying
the range of lines occupied by the hunk.

Note that the minimum size of a line range depends on the amount
of context provided in the unidiff. For standard unidiff files with 3
lines of leading and trailing context, the minimum length of the original
text is 6 lines (e.g. consider a hunk adding a single line to the file).

Let C represent the current index into the lists of candidate offsets.
Set C = 1 (indexing the first candidate offset).
Iterate over the list of hunks.
If the hunk is already marked applied or rejected, ignore it.
If the hunk's Cth candidate offset is not contained in any range described
by the list of occupied lines, store its current candidate offset and the
length of its original text in the list of occupied line ranges and mark
the hunk as applied.
Else, if the hunk is not marked applied, and its current candidate offset
is within the range of occupied lines, defer handling of this hunk to the
next iteration.
If the hunk has no Cth candidate offset, mark the hunk as rejected.
Increment C and iterate.

After this step, all hunks are marked either applied or rejected.
If any hunks were rejected and rejection is not allowed, raise an error.

Repeat the above process for the next patch target.

Once all patch targets are processed as above, read each target file again
from the top, copying lines not contained in the list of lines occupied by
hunks verbatim. Replace any lines occupied by a hunk with lines of the hunk's
modified text.
(Note that the number of lines in the modified text of a hunk may differ
from the number of lines in the original text of a hunk, so the patched
file can have a different number of lines than the original file. All offsets
are relative to the original file, however.)

Write any rejected hunks for this target to a reject file.

The patch is now applied.

For example, consider the hunks H1 and H2:
  H1.orig_off = 40
  H1.orig_len = 10

  H2.orig_off = 19
  H2.orig_len = 6

  H3.orig_off = 2
  H3.orig_len = 7

Imagine we found the following candidate offsets for H1 and H2:
  H1.cand_off = [ 15, 25, 45 ]
  H2.cand_off = [ 10, 20, 30, 50 ]
  H3.cand_off = [ 51 ]

Sorting the candidate lists according to their distance from the
original offsets, we get:

  H1.cand_off = [ 45, 25, 15 ]
  H2.cand_off = [ 20, 10, 30, 50 ]
  H3.cand_off = [ 51 ]

Sorting the list of hunks according to their first candidate's distance
to their original offfset, we get:

  H2.orig_off = 19
  H2.cand_off = [ 20, 10, 30, 50 ]

  H1.orig_off = 40
  H1.cand_off = [ 45, 25, 15 ]

  H3.orig_off = 2
  H3.cand_off = [ 51 ]

During the next step, H2 will be marked as applied in range [20, 26],
and H1 will be marked as applied in range [45, 55], and H3 will be
rejected since it only matches in a range already occupied by H2.

Reply via email to