approximate string matching in regexes

Sean O'Rourke Mon, 02 Sep 2002 08:27:20 -0700

Edit distance is a measure of the number of edits required to
turn one string into another, basically the size of a
character-by-character "diff".  For example, the edit distance
between "aches" and "access" is one change and one insert:


aches
acCes
accesS

Depending on how you define the costs of edit operations, this
could either cost 2 (one for the change and one for the insert)
or 3 (the change counts as a delete followed by an insert).  Edit
distance can be used for approximate string matching, by saying
that one string "equals" another if the edit distance between the
two strings is less than some threshhold.

Approximate matching cannot be done with regular expressions
short of making an alternation consisting of all possible
alternatives withing a given distance.  Using an edit distance
algorithm, it can be done in O(nm) time and space, where n and m
are the lengths of the two strings to be compared.  When matching
a pattern against a substring of a text this becomes O(n^2),
where n is the pattern length.

Since it solves a common class of pattern-matching problems not
addressed by regexes or grammars, I think this would make a
valuable addition to the Parrot regex engine.  I propose
something along the following lines:

rx_amatch Sx, Ix, Sy, Nx

    Do an approximate match of Sy against Sx at Ix with maximum
    edit distance floor(Nx * length(Sy)).  Push a mark, then the
    end positions of the approximate matches, onto the regex
    stack, with the closest match on top.

Using this op, matching "/[:approx(0.3) aches] s/" against a
substring of "access" at position 0 would be turned into Parrot
bytecode like so:

    set S0, "access"
    set S1, "aches"
    set I0, 0

    rx_amatch S0, I0, S1, 0.3
next_amatch:
    rx_popindex I0, fail
    rx_literal S0, I0, "s", next_amatch
    branch succ
fail:
    print "no "
succ:
    print "match\n"

I attached an implementation of this op.  Comments welcome.

/s

amatch.tgz
Description: Binary data

approximate string matching in regexes

Reply via email to