Johan / Stefan - any progress on the reviewing part? Only possible error I'm aware of is that I made the indices/counts 32 bit signed/unsigned integers, which means it'll run into trouble if there are more than ~2 billion different tokens (lines) or more than ~4 billion occurrences of the same token. If the library should be able to handle such files, they should be made 64 bit (for the indices, it should be safe to keep them 32 bit on 32 bit machines, as those could never have the memory to run such diffs anyway).
I have a couple of other changes ready: One is a minor simplification, by getting rid of the idx parameter in lcs.c - there is a simpler way to accomplish what it's for. The other change is a partial implementation of the other possible enhancement I mentioned earlier, which takes advantage of unique matches to "restart" the lcs algorithm whenever possible, thus reducing the effect of the square computational complexity. The "partial" part is that this version is only really effective if the two files have roughly the same number of (non-unique) lines; it will not reduce the (P * length difference) part of the complexity. For good cases, though, it can reduce the time required by the lcs algorithm by large factors (roughly the number of well-separated sections of difference, if they are all about the same size). The question is how I should post these changes. The idx change is relatively independent of the others; should I post a patch from current HEAD revision for that change alone, should I post it in this thread or a new one, and should I specify how that patch interacts with the other patches (and if so, where)? The other patch is based on my original patch; should I post a patch of the additional changes or a cumulative patch from current HEAD revision?