Log message: Speed-up of libsvn_diff using token counts By using indices, not node pointers, to refer to tokens, and counting the number of each token, the longest common subsequence (lcs) algorithm at the heart of libsvn_diff becomes much faster in many situations by skipping tokens that are unique to one source or the other.
* subversion/libsvn_diff/token.c (svn_diff__node_t): Added 'index' member. (svn_diff__tree_t): Added 'node_count' member. (svn_diff__get_token_count): New method. (tree_insert_token): Assigns indices to nodes. (svn_diff__get_tokens): Uses index, not pointer, to identify node in position struct. * subversion/libsvn_diff/lcs.c (svn_diff__snake): Takes token counts as additional argument. Skips tokens that are unique to one or the other file. (svn_diff__lcs): Takes token counts and number of different tokens as additional arguments. Calculates number of tokens unique to each source and uses this to compute effective length difference. * subversion/libsvn_diff/diff.h (svn_diff__position_t): Changed node member from being a pointer to the node to an integer index for the node. Updated method declarations. * subversion/libsvn_diff/diff.c * subversion/libsvn_diff/diff3.c * subversion/libsvn_diff/diff4.c (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the number of times each token appears in each source. (svn_diff__resolve_conflict): Takes number of different tokens as additional argument. Counts the number of times each token appears in each (partial) source. Comments: This patch can reduce the time required for a diff by a large factor when comparing long files with large differences. As an extreme case, splitting sqlite.c into two roughly equal parts and diffing them against each other took about a minute on my laptop with the original code, and 7-8 seconds with my patch. The speed-up is gained only if a large fraction of the changed tokens are unique to one source or the other. I can not see that the change would cause a significant performance degradation in any reasonable case. Counting the number of times each token appears is also useful for potential additional features in the future, such as identifying moved blocks outside of the longest common subsequence. I have tested only svn_diff_diff_2 (that it gives identical results as before), but the changes to svn_diff_diff3 and svn_diff_diff4 are completely analogous. Tested on a Windows 7 64-bit machine, TortoiseSVN build script. Morten Kloster (first patch)