Asger Ottar Alstrup wrote:

> Alfredo Braunstein wrote:
>> Amazing review, thanks. On a fast read, I've found that the variant that
>> perfectly fits our edition/CT capabilities is 1-degree, unit cost edit
>> distance. Unfortunately the algorithm cited for this purpose [Sel77,
>> Didn't look at it yet, though.] seems to have time and space complexity
>> of
>> 
>> O(|T1||T2|)
>> 
>> which doesn't seem very practical (and by the way, this seems roughly
>> similar to what you get with string edit distance in which insets are
>> symbols of an extended alphabet with cost of substitution recursively
>> defined as edit distance, using plain Levenshtein algorithm)
> 
> Only X percent of insets will have changed when you compare two
> documents. In a long document, X will be small.
> 
> Therefore, you can probably optimise most of the problem away so that
> complexity will not hurt you. Calculate a SHA-1 of the contents of each
> inset, and use that for initial comparisons, rather than the complete
> list of characters.
> 
> Then complexity is the product of the number of insets squared, which
> shouldn't be to bad.
> 
> After this, you get a reduced list of insets which are changed, and then
> you can rerun the algorithm on the text of those only.

While this seems a nice optimization, I don't think that this really helps.
I don't think that it's comparison between insets that hurts performance:
assume O(1) length of insets and O(N) length of main text (a fairly
reasonable assumption for the UserGuide I'd say). Making the full
comparison table of insets it's probably not hard.

For the records, the UG has 

~215113 chars (wc -c on the .txt export)
~978 inset (grep -c "begin_inset [^Q]" on the UserGuide)

IMO, the problem is the length of the main text, associated with the fact
that the alphabet is not constant (because of insets) and that the
substitution cost is not unit. I didn't find yet an algorithm whose
complexity is substantially less than O(NN') in this situation.

A/


Reply via email to