Wow. Since you went to the trouble of writing all this up, it really ought to go in a FAQ somewhere.
On Thu, Aug 08, 2002 at 12:05:00AM -0400, Mark J. Reed wrote: > Finite state machines can match regular expressions whose only operations > are closure (*), alternation (|), and grouping. Some of the other things Don't forget optional subexpressions (?). Not sure what the official name is. They can also handle negation, but nobody ever seems to put that into the regex syntax. Probably because the exact semantics get fuzzy when you're not anchored at the beginning and end, and any particular semantics you might pick are deceptive and/or useless. > You can turn an nondeterministic FA into a deterministic one (a DFA) > by a fairly straightforward algorithm. Basically, each state in the > DFA corresponds to a set of all the possible states the NFA *could* > be in at a given point in a string. So the number of states in > a DFA built from an NFA can be huge; but the machine never has to > backtrack and try again because it knows exactly which transition > to make at each step. So in general, NFAs take less memory but > run slower, while DFAs run faster but take more memory. And if you implement the NFA the way you're implying (as a traditional NFA), you'll always get the same answer to whether or not the expression matches, but assuming the answer is yes the NFA and DFA might match different parts of the string. Which doesn't contradict anything you're saying, but it's good to be aware that the theoretical purpose for these things (deciding whether a given string matches an expression or not) is not the same as the common practical purpose (finding the exact substring within a given string that matches the expression). > (Historically, the grep program has used an NFA, while egrep has > used a DFA.) That sounds backwards. You should probably also mention somewhere that once you break free of the bounds of regularity, it's much easier to implement many features and irregular constructs with an NFA rather than a DFA. Like capturing parentheses, alternation that prefers the choice the expression gives first, etc. > A step up from regular languages are context-free languages, which can > count. C-F languages can be recognized by a variety of different > mechanisms, but not by FSMs. Anyone happen to know where pushdown automata fit in this list? Can they handle context-sensitive, just context-free, or some other subset?