John Nagle <na...@animats.com> writes: > A dictionary lookup (actually, several of them) for every > input character is rather expensive. Tokenizers usually index into > a table of character classes, then use the character class index in > a switch statement.
Maybe you could use a regexp (and then have -two- problems...) to find the token boundaries, then a dict to identify the actual token. Tables of character classes seem a bit less attractive in the Unicode era than in the old days. -- http://mail.python.org/mailman/listinfo/python-list