John Nagle <na...@animats.com> writes:
>     A dictionary lookup (actually, several of them) for every
> input character is rather expensive. Tokenizers usually index into
> a table of character classes, then use the character class index in
> a switch statement.

Maybe you could use a regexp (and then have -two- problems...) to
find the token boundaries, then a dict to identify the actual token.
Tables of character classes seem a bit less attractive in the Unicode
era than in the old days.
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to