Alfredo Braunstein wrote:

> Asger Ottar Alstrup wrote:
> 
>> Alfredo Braunstein wrote:
>>> Amazing review, thanks. On a fast read, I've found that the variant that
>>> perfectly fits our edition/CT capabilities is 1-degree, unit cost edit
>>> distance. Unfortunately the algorithm cited for this purpose [Sel77,
>>> Didn't look at it yet, though.] seems to have time and space complexity
>>> of
>>> 
>>> O(|T1||T2|)
>>> 
>>> which doesn't seem very practical (and by the way, this seems roughly
>>> similar to what you get with string edit distance in which insets are
>>> symbols of an extended alphabet with cost of substitution recursively
>>> defined as edit distance, using plain Levenshtein algorithm)
>> 
>> Only X percent of insets will have changed when you compare two
>> documents. In a long document, X will be small.
>> 
>> Therefore, you can probably optimise most of the problem away so that
>> complexity will not hurt you. Calculate a SHA-1 of the contents of each
>> inset, and use that for initial comparisons, rather than the complete
>> list of characters.
>> 
>> Then complexity is the product of the number of insets squared, which
>> shouldn't be to bad.
>> 
>> After this, you get a reduced list of insets which are changed, and then
>> you can rerun the algorithm on the text of those only.
> 
> While this seems a nice optimization, I don't think that this really
> helps. I don't think that it's comparison between insets that hurts

Moreover, this optimization only helps on the diagonal of the matrix of
distance(I1, I2) (assuming one is comparing UserGuide with
UserGuide+epsilon). So I would say that it won't help noticeably if one is
computing this full matrix. 

This is not to say that this couldn't be helpful nevertheless. In particular
many optimizations of the original dynamic-programming algorithm avoid
exploring regions of the distance matrix based on certain bounds. In this
context having a fast decision d(I1,I2) == 0 vs. d(I1,I2) >= 1 may be
crucial. Still, details on how to use this are lacking for the
moment... ;-)

A/


Reply via email to