Julian asked for a formal description of how 'svn patch' applies hunks. Below are my current plans. The current code does something quite different and will be changed to implement what is described below before 1.7 release.
Any feedback appreciated. Stefan This note is a high-level description of how 'svn patch' applies hunks from a patch file to a file in the working copy. First, the unidiff input is parsed by libsvn_diff, and passed to the patch application code as a list of patches, each containing a list of hunks to be applied to the patch target. The patch application code in libsvn_client is shielded from the specifics of the unidiff file format. All it really cares about is: 1) The 'patch target', which is the file to be patched. 2) The 'original' and 'modified' texts of each hunk in the patch, and for each hunk, the line offset at which the original text should be replaced with the modified text. The 'original' text is what the lines described by the hunk looked like before they were edited by the person who created the patch. The 'modified' text is what the lines described by the hunk looked like after editing. Both texts include surrounding lines of context obtained from the unidiff. Each hunk has an 'original' line offset, described in the hunk header. This line offset describes where the lines described by the hunk were located in the original, i.e. unpatched, file. This is the preferred location for applying the hunk. The goals of the hunk-application algorithm are: 1) Memory usage is bound by the longest line occurring in the patch file or the patch target. Never even attempt to read an entire file into RAM. 2) File seeks should be kept at a minimum. 3) Hunks matching at or near their original offset in the target file are always preferred over hunks matching at the same location but with a greater offset relative to their original offset. 4) Every hunk in the patch gets a chance to be applied, independent of matching results for hunks which were matched before it or will be matched after it. Hunks are rejected only if they don't match anywhere in the file, or if they collide with another hunk which is a better match according to 3). This implies that: 4a) Hunks are not assumed to be listed in any particular order. 4b) The offsets listed in hunk headers are only used as a guide, not as an authoritative reference. 4c) It is possible for a hunk to match at multiple candidate locations, but it will only be applied at one of them. 5) Hunks are only applied if they do not overlap with any other hunk. 6) By default, a patch is either applied fully or not at all. Rejection of hunks is only allowed if the user has given permission. If the user not given such permission, attempting to apply a patch with rejected hunks will raise an error and leave the working copy unmodified. A patch may want to change more than one target file. Each patch target is scanned line-by-line, and possible matches for each of the patch's hunks are determined by matching the 'original' hunk text to the patch target at the current line offset. This results in a list of candidate offsets, one such list for each hunk. For each hunk, its lists of candidate offsets is now sorted such that offsets closest to the hunk's original offset (as parsed from the hunk header in the patch file) come first. If two candidate offsets have the same distance from the original offset, the order they are listed in is undefined (depending on the sorting algorithm used, the result might be deterministic, but the hunk application algorithm does not rely on this). Next, the list of hunks is sorted, such that hunks whose first candidate offset is closest to the hunk's original offset come first. Next, the algorithm builds a list of line ranges occupied by hunks which were successfully applied. In each item of this list, the occupying hunk's line offset as well as the number of lines in the hunk's original text is stored, specifying the range of lines occupied by the hunk. Note that the minimum size of a line range depends on the amount of context provided in the unidiff. For standard unidiff files with 3 lines of leading and trailing context, the minimum length of the original text is 6 lines (e.g. consider a hunk adding a single line to the file). Let C represent the current index into the lists of candidate offsets. Set C = 1 (indexing the first candidate offset). Iterate over the list of hunks. If the hunk is already marked applied or rejected, ignore it. If the hunk's Cth candidate offset is not contained in any range described by the list of occupied lines, store its current candidate offset and the length of its original text in the list of occupied line ranges and mark the hunk as applied. Else, if the hunk is not marked applied, and its current candidate offset is within the range of occupied lines, defer handling of this hunk to the next iteration. If the hunk has no Cth candidate offset, mark the hunk as rejected. Increment C and iterate. After this step, all hunks are marked either applied or rejected. If any hunks were rejected and rejection is not allowed, raise an error. Repeat the above process for the next patch target. Once all patch targets are processed as above, read each target file again from the top, copying lines not contained in the list of lines occupied by hunks verbatim. Replace any lines occupied by a hunk with lines of the hunk's modified text. (Note that the number of lines in the modified text of a hunk may differ from the number of lines in the original text of a hunk, so the patched file can have a different number of lines than the original file. All offsets are relative to the original file, however.) Write any rejected hunks for this target to a reject file. The patch is now applied. For example, consider the hunks H1 and H2: H1.orig_off = 40 H1.orig_len = 10 H2.orig_off = 19 H2.orig_len = 6 H3.orig_off = 2 H3.orig_len = 7 Imagine we found the following candidate offsets for H1 and H2: H1.cand_off = [ 15, 25, 45 ] H2.cand_off = [ 10, 20, 30, 50 ] H3.cand_off = [ 51 ] Sorting the candidate lists according to their distance from the original offsets, we get: H1.cand_off = [ 45, 25, 15 ] H2.cand_off = [ 20, 10, 30, 50 ] H3.cand_off = [ 51 ] Sorting the list of hunks according to their first candidate's distance to their original offfset, we get: H2.orig_off = 19 H2.cand_off = [ 20, 10, 30, 50 ] H1.orig_off = 40 H1.cand_off = [ 45, 25, 15 ] H3.orig_off = 2 H3.cand_off = [ 51 ] During the next step, H2 will be marked as applied in range [20, 26], and H1 will be marked as applied in range [45, 55], and H3 will be rejected since it only matches in a range already occupied by H2.