Bruno Desthuilliers wrote: > Is the array of lines the appropriate data structure here ?
I've done tokenizers both as an array of lines and as a long string. The former has seemed easier when the language treats EOL as a statement separator. re not letting literal strings in code terminate blocks, I think its the tokenizer-writer's job to be nice to the tokenizer users, the first one of which will be me, and I'll definitely have string literals that enclose what would otherwise be a block end marker. > While we're at it, you may not know but there are already a couple > Python packages for building tokenizers/parsers The tokenizer in the Python library is pretty close to what I want, but it returns tuples, where I want an array of Token objects. It also reads the source a line at a time, which seems a bit out of date. Maybe two or three decades out of date. Actually, it takes about a day to write a reasonable tokenizer. (That is, if you are writing using a language that you know.) Since I know the problem thoroughly, it seemed like a good starting point for learning Python. There's a tokenizer I wrote in java at http://www.MartinRinehart.com/src/language/Tokenizer.html . Actually, that's an HTML page written by my "javasrc" (parallel to Sun's javadoc) based on the Tokenizer's tokenizing of its own source. Have I got those quotes right? -- http://mail.python.org/mailman/listinfo/python-list