On Mon, Aug 14, 2017 at 4:57 PM, Jonathan Tan <[email protected]> wrote:
> The existing behavior of diff --color-moved=zebra does not define the
> minimum size of a block at all, instead relying on a heuristic applied
> later to filter out sets of adjacent moved lines that are shorter than 3
> lines long. This can be confusing, because a block could thus be colored
> as moved at the source but not at the destination (or vice versa),
> depending on its neighbors.
>
> Instead, teach diff that the minimum size of a block is 10
> non-whitespace characters. This allows diff to still exclude
> uninteresting lines appearing on their own (such as those solely
> consisting of one or a few closing braces), as was the intention of the
> adjacent-moved-line heuristic.
After some thought, I really like this heuristic, however allow me
a moment to bikeshed 10 as a number here.
One could think that 10 equals roughly 3 lines a 3 characters and
in C based languages the shortest meaningful lines have more than
3 characters ("i++;", "a();", "int i;" have 4 or 5 each), but I would still
think that 10 is too much, as we'd want to detect the closing braces
in their own lines.
> dimmed_zebra::
> Similar to 'zebra', but additional dimming of uninteresting parts
> of moved code is performed. The bordering lines of two adjacent
> diff --git a/diff.c b/diff.c
> index f598d8a3a..305ce4126 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -864,19 +864,28 @@ static int shrink_potential_moved_blocks(struct
> moved_entry **pmb,
> /*
> * If o->color_moved is COLOR_MOVED_PLAIN, this function does nothing.
> *
> - * Otherwise, if the last block has fewer lines than
> - * COLOR_MOVED_MIN_BLOCK_LENGTH, unset DIFF_SYMBOL_MOVED_LINE on all lines in
> - * that block.
> + * Otherwise, if the last block has fewer non-space characters than
> + * COLOR_MOVED_MIN_NON_SPACE_COUNT, unset DIFF_SYMBOL_MOVED_LINE on all lines
> + * in that block.
> *
> * The last block consists of the (n - block_length)'th line up to but not
> * including the nth line.
> */
> static void adjust_last_block(struct diff_options *o, int n, int
> block_length)
> {
> - int i;
> - if (block_length >= COLOR_MOVED_MIN_BLOCK_LENGTH ||
> - o->color_moved == COLOR_MOVED_PLAIN)
> + int i, non_space_count = 0;
> + if (o->color_moved == COLOR_MOVED_PLAIN)
> return;
> + for (i = 1; i < block_length + 1; i++) {
> + const char *c = o->emitted_symbols->buf[n - i].line;
> + for (; *c; c++) {
> + if (isspace(*c))
> + continue;
> + non_space_count++;
> + if (non_space_count >=
> COLOR_MOVED_MIN_NON_SPACE_COUNT)
> + return;
When we do this counting, we could count the lines ourselves here as well.
`n-block_count` should be equal to the line that has a different
(flags & (DIFF_SYMBOL_MOVED_LINE | DIFF_SYMBOL_MOVED_LINE_ALT))
pattern than those before. (although we'd also have to check for i > 0, too)
Your choice.
> + }
> + }
> for (i = 1; i < block_length + 1; i++)
> o->emitted_symbols->buf[n - i].flags &=
> ~DIFF_SYMBOL_MOVED_LINE;
> }
> @@ -923,7 +932,6 @@ static void mark_color_as_moved(struct diff_options *o,
> }
>
> l->flags |= DIFF_SYMBOL_MOVED_LINE;
> - block_length++;
>
> if (o->color_moved == COLOR_MOVED_PLAIN)
> continue;
> @@ -953,8 +961,13 @@ static void mark_color_as_moved(struct diff_options *o,
> }
>
> flipped_block = (flipped_block + 1) % 2;
> +
> + adjust_last_block(o, n, block_length);
> + block_length = 0;
> }
>
> + block_length++;
> +
> if (flipped_block)
> l->flags |= DIFF_SYMBOL_MOVED_LINE_ALT;
> }
> diff --git a/diff.h b/diff.h
> index 5755f465d..9e2fece5b 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -195,7 +195,7 @@ struct diff_options {
> COLOR_MOVED_ZEBRA_DIM = 3,
> } color_moved;
> #define COLOR_MOVED_DEFAULT COLOR_MOVED_ZEBRA
> - #define COLOR_MOVED_MIN_BLOCK_LENGTH 3
> + #define COLOR_MOVED_MIN_NON_SPACE_COUNT 10
> };
>
> void diff_emit_submodule_del(struct diff_options *o, const char *line);
> diff --git a/t/t4015-diff-whitespace.sh b/t/t4015-diff-whitespace.sh
> index 6f7758e5c..d8e7b77b9 100755
> --- a/t/t4015-diff-whitespace.sh
> +++ b/t/t4015-diff-whitespace.sh
> @@ -1101,9 +1101,9 @@ test_expect_success 'detect malicious moved code,
> inside file' '
> <BRED>-{<RESET>
> <BLUE>-if (!u->is_allowed_foo)<RESET>
> <BLUE>-return;<RESET>
> - <BRED>-foo(u);<RESET>
> - <BLUE>-}<RESET>
> - <BLUE>-<RESET>
> + <RED>-foo(u);<RESET>
> + <RED>-}<RESET>
> + <RED>-<RESET>
Here we have 2 blocks, the first has 7 character,
which we may want to detect, the second has only 1 char.
The longest "uninteresting" line in C like languages might
be "\t } else {" which has 6 non-ws characters.
Thinking of other languages (shell "fi" is uninteresting, others are
interesting,
Latex \"custom" all bets are off), I think we may want to go lower and have
COLOR_MOVED_MIN_NON_SPACE_COUNT to be about 6
(~ 2 characters on a 3 line block).
That said this is all bikeshedding, feel free to ignore.
Acked-by: Stefan Beller <[email protected]>