OK that worked really well. In particular, the "lastindex" property of
the match object can be used to tell exactly which group matched,
without having to sequentially search the list of groups.
In fact, I was able to use your idea to cobble together a poor man's
lexer which I am calling "reflex" (Regular Expressions For Lexing).
Here's an example of how it's used:
# Define the states using an enumeration
State = Enum( 'Default', 'Comment', 'String' )
# Create a scanner
scanner = reflex.scanner( State.Default )
scanner.rule( "\s+" )
scanner.rule( "/\*", reflex.set_state( State.Comment ) )
scanner.rule( "[a-zA-Z_][\w_]*", KeywordOrIdent )
scanner.rule( "0x[\da-fA-F]+|\d+", token=TokenType.Integer )
scanner.rule(
"(?:\d+\.\d*|\.\d+)(?:[eE]?[+-]?\d+)|\d+[eE]?[+-]?\d+",
token=TokenType.Real )
# Multi-line comment state
scanner.state( State.Comment )
scanner.rule( "\*/", reflex.set_state( State.Default ) )
scanner.rule( "(?:[^*]|\*(?!/))+" )
# Now, create an instance of the scanner
token_stream = scanner( input_file_iter )
for token in token_stream:
print token
Internally, it creates an array of patterns and actions for each state.
Then when you ask it to create a scanner instance, it combines all of
the patterns into a large regular expression. Input lines are marched
vs. this regex, and if a match succeeds, then the match object's
'lastindenx' property is used to look up the actions to perform in the
array.
--
http://mail.python.org/mailman/listinfo/python-list