On Sat, 2010-10-09, Johan Corveleyn wrote:
> Ok, third iteration of the patch in attachment. It passes make check.
>
> As discussed in [1], this version keeps 50 lines of the identical
> suffix around, to give the algorithm a good chance to generate a diff
> output of good quality (in all but the most extreme cases, this will
> be the same as with the original svn_diff algorithm).
>
> That's about the only difference with the previous iteration. So for
> now, I'm submitting this for review. Any feedback is very welcome :-).
Hi Johan.
I haven't reviewed it, but after seeing today's discussion I had just
scrolled quickly through the previous version of this patch. I noticed
that the two main functions - find_identical_suffix and
find_identical_suffix - are both quite similar (but not quite similar
enough to make them easily share implementation) and both quite long,
and I noticed you wrote in an earlier email that you were finding it
hard to make the code readable. I have a suggestion that may help.
I think the existing structure of the svn_diff__file_baton_t is
unhelpful:
{
const svn_diff_file_options_t *options;
const char *path[4];
apr_file_t *file[4];
apr_off_t size[4];
int chunk[4];
char *buffer[4];
char *curp[4];
char *endp[4];
/* List of free tokens that may be reused. */
svn_diff__file_token_t *tokens;
svn_diff__normalize_state_t normalize_state[4];
apr_pool_t *pool;
} svn_diff__file_baton_t;
All those array[4] fields are logically related, but this layout forces
the programmer to address them individually.
So I wrote a patch - attached - that refactors this into an array of 4
sub-structures, and simplifies all the code that uses them.
I think this will help you to get better code clarity because then your
increment_pointer_or_chunk() for example will be able to take a single
pointer to a file_info structure instead of a lot of pointers to
individual members of the same.
Would you take a look and let me know if you agree. If so, I can commit
the refactoring straight away.
Responding to some of the other points you mentioned in a much earlier
mail:
> 3) It's only implemented for 2 files. I'd like to generalize this for
> an array of datasources, so it can also be used for diff3 and diff4.
>
> 4) As a small hack, I had to add a flag "datasource_opened" to
> token.c#svn_diff__get_tokens, so I could have different behavior for
> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
> no longer needed.
Yes, I'd like to see 3), and so hack 4) will go away.
> 5) I've struggled with making the code readable/understandable. It's
> likely that there is still a lot of room for improvement. I also
> probably need to document it some more.
You need to write a full doc string for datasources_open(), at least.
It needs especially to say how it relates to datasource_open() - why
should the caller call this function rather than that function, and are
they both necessary - and how the caller is supposed to use the
'prefix_lines' parameter. And doc strings for
increment/decrement_...().
But this makes me think, it looks to me like this whole
prefix-suffix-skipping functionality would fit better inside the
lower-level diff algorithm rather than inside the "datasources_open"
function. Do you agree?
It works as it is, but you were talking about wanting it to obey the
standard token-parsing rules such as "ignore white space", and so on.
It seems that things like this would be much better one level down.
> 6) I've not paid too much attention to low level optimizations, so
> here also there's probably room for improvement, which may be
> significant for the critical loops.
Maybe. Not important at this stage.
> 7) There are two warnings left "conversion from 'apr_off_t' to 'int'",
> in diff_file.c#find_identical_suffix. To completely eliminate this, I
> should probably make all "chunks" of type apr_off_t instead of int
> (but I have not done that yet, because the original implementation
> also used int for the chunk in the svn_diff__file_baton_t struct).
> Should I do this (also in the baton struct)? Or should I use an
> explicit cast somewhere?
I recommend using an explicit cast where needed, in this patch.
Changing the 'chunk' data type everywhere could be done in a separate
patch, either before or after this patch.
- Julian
> I still consider this a WIP, because of the following remaining todo's
> (which may have a lot of impact on the current implementation):
> - Generalize for more than 2 datasources (for diff3 and diff4).
> - revv svn_diff_fns_t and maybe other stuff I've changed in public API.
> - Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe
> switch the implementation to read out entire lines before comparing
> (like datasources_get_next_token does).
>
> Log message:
> [[[
> Make svn_diff_diff skip identical prefix and suffix to make diff and blame
> faster.
>
> * subversion/include/svn_diff.h
> (svn_diff_fns_t): Added new function types datasources_open and
> get_prefix_lines to the vtable.
>
> * subversion/libsvn_diff/diff_memory.c
> (datasources_open): New function (does nothing).
> (get_prefix_lines): New function (does nothing).
> (svn_diff__mem_vtable): Added new functions datasources_open and
> get_prefix_lines.
>
> * subversion/libsvn_diff/diff_file.c
> (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
> and suffix_offset_in_chunk.
> (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
> (find_identical_prefix, find_identical_suffix): New functions.
> (datasources_open): New function, to open both datasources and find their
> identical prefix and suffix. From the identical suffix, 50 lines are kept
> to
> help the diff algorithm find the nicest possible diff representation
> in case of ambiguity.
> (get_prefix_lines): New function.
> (datasource_get_next_token): Stop at start of identical suffix.
> (svn_diff__file_vtable): Added new functions datasources_open and
> get_prefix_lines.
>
> * subversion/libsvn_diff/diff.h
> (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
> the datasource was already opened.
>
> * subversion/libsvn_diff/token.c
> (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
> datasource if datasource_opened is FALSE. Set the starting offset of the
> position list to the number of prefix lines.
>
> * subversion/libsvn_diff/lcs.c
> (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
> the offset of the sentinel position for EOF, even if one of the files
> became empty after eliminating the identical prefix.
>
> * subversion/libsvn_diff/diff.c
> (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
> (svn_diff_diff): Use new function datasources_open, to open original and
> modified at once, and find their identical prefix and suffix. Pass
> prefix_lines to svn_diff__lcs and to svn_diff__diff.
>
> * subversion/libsvn_diff/diff3.c
> (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
> Pass prefix_lines = 0 to svn_diff__lcs.
>
> * subversion/libsvn_diff/diff4.c
> (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
> Pass prefix_lines = 0 to svn_diff__lcs.
> ]]]
>
>
> Cheers,
Re-factor a group of array variables into an array of structures, and
thereby simplify their usage.
* subversion/libsvn_diff/diff_file.c
(svn_diff__file_baton_t): Move a set of related array variables into a new
sub-structure array.
(datasource_open, datasource_get_next_token, token_compare): By taking a
pointer to the appropriate element of the sub-structure, simplify all
references to these variables.
(svn_diff_file_diff_2, svn_diff_file_diff3_2, svn_diff_file_diff4_2):
Adjust references.
--This line, and those below, will be ignored--
Index: subversion/libsvn_diff/diff_file.c
===================================================================
--- subversion/libsvn_diff/diff_file.c (revision 1006000)
+++ subversion/libsvn_diff/diff_file.c (working copy)
@@ -67,21 +67,24 @@ typedef struct svn_diff__file_token_t
typedef struct svn_diff__file_baton_t
{
const svn_diff_file_options_t *options;
- const char *path[4];
- apr_file_t *file[4];
- apr_off_t size[4];
+ struct file_info {
+ const char *path;
- int chunk[4];
- char *buffer[4];
- char *curp[4];
- char *endp[4];
+ apr_file_t *file;
+ apr_off_t size;
+
+ int chunk;
+ char *buffer;
+ char *curp;
+ char *endp;
+
+ svn_diff__normalize_state_t normalize_state;
+ } files[4];
/* List of free tokens that may be reused. */
svn_diff__file_token_t *tokens;
- svn_diff__normalize_state_t normalize_state[4];
-
apr_pool_t *pool;
} svn_diff__file_baton_t;
@@ -203,21 +206,19 @@ static svn_error_t *
datasource_open(void *baton, svn_diff_datasource_e datasource)
{
svn_diff__file_baton_t *file_baton = baton;
- int idx;
+ struct file_info *file = &file_baton->files[datasource_to_index(datasource)];
apr_finfo_t finfo;
apr_off_t length;
char *curp;
char *endp;
- idx = datasource_to_index(datasource);
-
- SVN_ERR(svn_io_file_open(&file_baton->file[idx], file_baton->path[idx],
+ SVN_ERR(svn_io_file_open(&file->file, file->path,
APR_READ, APR_OS_DEFAULT, file_baton->pool));
SVN_ERR(svn_io_file_info_get(&finfo, APR_FINFO_SIZE,
- file_baton->file[idx], file_baton->pool));
+ file->file, file_baton->pool));
- file_baton->size[idx] = finfo.size;
+ file->size = finfo.size;
length = finfo.size > CHUNK_SIZE ? CHUNK_SIZE : finfo.size;
if (length == 0)
@@ -226,10 +227,10 @@ datasource_open(void *baton, svn_diff_da
endp = curp = apr_palloc(file_baton->pool, (apr_size_t) length);
endp += length;
- file_baton->buffer[idx] = file_baton->curp[idx] = curp;
- file_baton->endp[idx] = endp;
+ file->buffer = file->curp = curp;
+ file->endp = endp;
- return read_chunk(file_baton->file[idx], file_baton->path[idx],
+ return read_chunk(file->file, file->path,
curp, length, 0, file_baton->pool);
}
@@ -252,7 +253,7 @@ datasource_get_next_token(apr_uint32_t *
{
svn_diff__file_baton_t *file_baton = baton;
svn_diff__file_token_t *file_token;
- int idx;
+ struct file_info *file = &file_baton->files[datasource_to_index(datasource)];
char *endp;
char *curp;
char *eol;
@@ -264,15 +265,13 @@ datasource_get_next_token(apr_uint32_t *
*token = NULL;
- idx = datasource_to_index(datasource);
+ curp = file->curp;
+ endp = file->endp;
- curp = file_baton->curp[idx];
- endp = file_baton->endp[idx];
-
- last_chunk = offset_to_chunk(file_baton->size[idx]);
+ last_chunk = offset_to_chunk(file->size);
if (curp == endp
- && last_chunk == file_baton->chunk[idx])
+ && last_chunk == file->chunk)
{
return SVN_NO_ERROR;
}
@@ -289,8 +288,8 @@ datasource_get_next_token(apr_uint32_t *
}
file_token->datasource = datasource;
- file_token->offset = chunk_to_offset(file_baton->chunk[idx])
- + (curp - file_baton->buffer[idx]);
+ file_token->offset = chunk_to_offset(file->chunk)
+ + (curp - file->buffer);
file_token->raw_length = 0;
file_token->length = 0;
@@ -311,7 +310,7 @@ datasource_get_next_token(apr_uint32_t *
}
}
- if (file_baton->chunk[idx] == last_chunk)
+ if (file->chunk == last_chunk)
{
eol = endp;
break;
@@ -320,21 +319,21 @@ datasource_get_next_token(apr_uint32_t *
length = endp - curp;
file_token->raw_length += length;
svn_diff__normalize_buffer(&curp, &length,
- &file_baton->normalize_state[idx],
+ &file->normalize_state,
curp, file_baton->options);
file_token->length += length;
h = svn_diff__adler32(h, curp, length);
- curp = endp = file_baton->buffer[idx];
- file_baton->chunk[idx]++;
- length = file_baton->chunk[idx] == last_chunk ?
- offset_in_chunk(file_baton->size[idx]) : CHUNK_SIZE;
+ curp = endp = file->buffer;
+ file->chunk++;
+ length = file->chunk == last_chunk ?
+ offset_in_chunk(file->size) : CHUNK_SIZE;
endp += length;
- file_baton->endp[idx] = endp;
+ file->endp = endp;
- SVN_ERR(read_chunk(file_baton->file[idx], file_baton->path[idx],
+ SVN_ERR(read_chunk(file->file, file->path,
curp, length,
- chunk_to_offset(file_baton->chunk[idx]),
+ chunk_to_offset(file->chunk),
file_baton->pool));
/* If the last chunk ended in a CR, we're done. */
@@ -349,7 +348,7 @@ datasource_get_next_token(apr_uint32_t *
length = eol - curp;
file_token->raw_length += length;
- file_baton->curp[idx] = eol;
+ file->curp = eol;
/* If the file length is exactly a multiple of CHUNK_SIZE, we will end up
* with a spurious empty token. Avoid returning it.
@@ -360,7 +359,7 @@ datasource_get_next_token(apr_uint32_t *
{
char *c = curp;
svn_diff__normalize_buffer(&c, &length,
- &file_baton->normalize_state[idx],
+ &file->normalize_state,
curp, file_baton->options);
file_token->norm_offset = file_token->offset;
@@ -388,13 +387,12 @@ token_compare(void *baton, void *token1,
char buffer[2][COMPARE_CHUNK_SIZE];
char *bufp[2];
apr_off_t offset[2];
- int idx[2];
+ struct file_info *file[2];
apr_off_t length[2];
apr_off_t total_length;
/* How much is left to read of each token from the file. */
apr_off_t raw_length[2];
int i;
- int chunk[2];
svn_diff__normalize_state_t state[2];
file_token[0] = token1;
@@ -420,17 +418,18 @@ token_compare(void *baton, void *token1,
for (i = 0; i < 2; ++i)
{
- idx[i] = datasource_to_index(file_token[i]->datasource);
+ int idx = datasource_to_index(file_token[i]->datasource);
+
+ file[i] = &file_baton->files[idx];
offset[i] = file_token[i]->norm_offset;
- chunk[i] = file_baton->chunk[idx[i]];
state[i] = svn_diff__normalize_state_normal;
- if (offset_to_chunk(offset[i]) == chunk[i])
+ if (offset_to_chunk(offset[i]) == file[i]->chunk)
{
/* If the start of the token is in memory, the entire token is
* in memory.
*/
- bufp[i] = file_baton->buffer[idx[i]];
+ bufp[i] = file[i]->buffer;
bufp[i] += offset_in_chunk(offset[i]);
length[i] = total_length;
@@ -458,15 +457,15 @@ token_compare(void *baton, void *token1,
NULL,
_("The file '%s' changed unexpectedly"
" during diff"),
- file_baton->path[idx[i]]);
+ file[i]->path);
/* Read a chunk from disk into a buffer */
bufp[i] = buffer[i];
length[i] = raw_length[i] > COMPARE_CHUNK_SIZE ?
COMPARE_CHUNK_SIZE : raw_length[i];
- SVN_ERR(read_chunk(file_baton->file[idx[i]],
- file_baton->path[idx[i]],
+ SVN_ERR(read_chunk(file[i]->file,
+ file[i]->path,
bufp[i], length[i], offset[i],
file_baton->pool));
offset[i] += length[i];
@@ -623,8 +622,8 @@ svn_diff_file_diff_2(svn_diff_t **diff,
memset(&baton, 0, sizeof(baton));
baton.options = options;
- baton.path[0] = original;
- baton.path[1] = modified;
+ baton.files[0].path = original;
+ baton.files[1].path = modified;
baton.pool = svn_pool_create(pool);
SVN_ERR(svn_diff_diff(diff, &baton, &svn_diff__file_vtable, pool));
@@ -645,9 +644,9 @@ svn_diff_file_diff3_2(svn_diff_t **diff,
memset(&baton, 0, sizeof(baton));
baton.options = options;
- baton.path[0] = original;
- baton.path[1] = modified;
- baton.path[2] = latest;
+ baton.files[0].path = original;
+ baton.files[1].path = modified;
+ baton.files[2].path = latest;
baton.pool = svn_pool_create(pool);
SVN_ERR(svn_diff_diff3(diff, &baton, &svn_diff__file_vtable, pool));
@@ -669,10 +668,10 @@ svn_diff_file_diff4_2(svn_diff_t **diff,
memset(&baton, 0, sizeof(baton));
baton.options = options;
- baton.path[0] = original;
- baton.path[1] = modified;
- baton.path[2] = latest;
- baton.path[3] = ancestor;
+ baton.files[0].path = original;
+ baton.files[1].path = modified;
+ baton.files[2].path = latest;
+ baton.files[3].path = ancestor;
baton.pool = svn_pool_create(pool);
SVN_ERR(svn_diff_diff4(diff, &baton, &svn_diff__file_vtable, pool));