Alfredo Braunstein wrote:
Amazing review, thanks. On a fast read, I've found that the variant that
perfectly fits our edition/CT capabilities is 1-degree, unit cost edit
distance. Unfortunately the algorithm cited for this purpose [Sel77, Didn't
look at it yet, though.] seems to have time and space complexity of
O(|T1||T2|)
which doesn't seem very practical (and by the way, this seems roughly
similar to what you get with string edit distance in which insets are
symbols of an extended alphabet with cost of substitution recursively
defined as edit distance, using plain Levenshtein algorithm)
Only X percent of insets will have changed when you compare two
documents. In a long document, X will be small.
Therefore, you can probably optimise most of the problem away so that
complexity will not hurt you. Calculate a SHA-1 of the contents of each
inset, and use that for initial comparisons, rather than the complete
list of characters.
Then complexity is the product of the number of insets squared, which
shouldn't be to bad.
After this, you get a reduced list of insets which are changed, and then
you can rerun the algorithm on the text of those only.
Regards,
Asger