On Nov 9, 7:55 am, Thomas Mlynarczyk <[EMAIL PROTECTED]> wrote: > Hello, > > I started to write a lexer in Python -- my first attempt to do something > useful with Python (rather than trying out snippets from tutorials). It > is not complete yet, but I would like some feedback -- I'm a Python > newbie and it seems that, with Python, there is always a simpler and > better way to do it than you think. > > ### Begin ### > > import re > > class Lexer(object):
So far, so good. > def __init__( self, source, tokens ): Be consistent with your punctuation style. I'd suggest *not* having a space after ( and before ), as in the previous line. Read http://www.python.org/dev/peps/pep-0008/ > self.source = re.sub( r"\r?\n|\r\n", "\n", source ) Firstly, would you not expect to be getting your text from a text file (perhaps even one opened with the universal newlines option) i.e. by the time it's arrived here, source has already had \r\n changed to \n? Secondly, that's equivalent to re.sub(r"\n|\r\n|\r\n", "\n", source) What's wrong with re.sub(r"\r\n", "\n", source) ? Thirdly, if source does contain \r\n and there is an error, the reported value of offset will be incorrect. Consider retaining the offset of the last newline seen, so that your error reporting can include the line number and (include or use) the column position in the line. > self.tokens = tokens > self.offset = 0 > self.result = [] > self.line = 1 > self._compile() > self._tokenize() > > def _compile( self ): > for name, regex in self.tokens.iteritems(): > self.tokens[name] = re.compile( regex, re.M ) > > def _tokenize( self ): Unless you have other plans for it, offset could be local to this method. > while self.offset < len( self.source ): You may like to avoid getting len(self.source) for each token. > for name, regex in self.tokens.iteritems(): dict.iter<anything>() will return its results in essentially random order. It doesn't matter with your example, but you will rapidly come across real-world cases where the order matters. One such case is distinguishing between real constants (1.23, .123, 123.) and integer constants (123). > match = regex.match( self.source, self.offset ) > if not match: continue > self.offset += len( match.group(0) ) > self.result.append( ( name, match, self.line ) ) > self.line += match.group(0).count( "\n" ) > break > else: > raise Exception( > 'Syntax error in source at offset %s' % > str( self.offset ) ) Using str() here and below is redundant ... "%s" % obj is documented to produce str(obj). > > def __str__( self ): > return "\n".join( > [ "[L:%s]\t[O:%s]\t[%s]\t'%s'" % For avoidance of ambiguity, you may like to change that '%s' to %r > ( str( line ), str( match.pos ), name, match.group(0) ) > for name, match, line in self.result ] ) > HTH, John -- http://mail.python.org/mailman/listinfo/python-list