Alfredo Braunstein wrote:
Amazing review, thanks. On a fast read, I've found that the variant that
perfectly fits our edition/CT capabilities is 1-degree, unit cost edit
distance. Unfortunately the algorithm cited for this purpose [Sel77, Didn't
look at it yet, though.] seems to have time and space complexity of

O(|T1||T2|)

which doesn't seem very practical (and by the way, this seems roughly
similar to what you get with string edit distance in which insets are
symbols of an extended alphabet with cost of substitution recursively
defined as edit distance, using plain Levenshtein algorithm)

Only X percent of insets will have changed when you compare two documents. In a long document, X will be small.

Therefore, you can probably optimise most of the problem away so that complexity will not hurt you. Calculate a SHA-1 of the contents of each inset, and use that for initial comparisons, rather than the complete list of characters.

Then complexity is the product of the number of insets squared, which shouldn't be to bad.

After this, you get a reduced list of insets which are changed, and then you can rerun the algorithm on the text of those only.

Regards,
Asger

Reply via email to