Edit distance is a measure of the number of edits required to turn one string into another, basically the size of a character-by-character "diff". For example, the edit distance between "aches" and "access" is one change and one insert:
aches acCes accesS Depending on how you define the costs of edit operations, this could either cost 2 (one for the change and one for the insert) or 3 (the change counts as a delete followed by an insert). Edit distance can be used for approximate string matching, by saying that one string "equals" another if the edit distance between the two strings is less than some threshhold. Approximate matching cannot be done with regular expressions short of making an alternation consisting of all possible alternatives withing a given distance. Using an edit distance algorithm, it can be done in O(nm) time and space, where n and m are the lengths of the two strings to be compared. When matching a pattern against a substring of a text this becomes O(n^2), where n is the pattern length. Since it solves a common class of pattern-matching problems not addressed by regexes or grammars, I think this would make a valuable addition to the Parrot regex engine. I propose something along the following lines: rx_amatch Sx, Ix, Sy, Nx Do an approximate match of Sy against Sx at Ix with maximum edit distance floor(Nx * length(Sy)). Push a mark, then the end positions of the approximate matches, onto the regex stack, with the closest match on top. Using this op, matching "/[:approx(0.3) aches] s/" against a substring of "access" at position 0 would be turned into Parrot bytecode like so: set S0, "access" set S1, "aches" set I0, 0 rx_amatch S0, I0, S1, 0.3 next_amatch: rx_popindex I0, fail rx_literal S0, I0, "s", next_amatch branch succ fail: print "no " succ: print "match\n" I attached an implementation of this op. Comments welcome. /s
amatch.tgz
Description: Binary data