Chris Angelico <ros...@gmail.com> writes: > Generally, I'm working with pure ASCII, but port those same algorithms > to Python and you'll easily be able to read in a file in some known > encoding and manipulate it as Unicode.
If it's pure ASCII, you can use the bytes or bytearray type. > It's not so much 'random access to the nth character' as an efficient > way of jumping forward. For instance, if I know that the next thing is > a literal string of n characters (that I don't care about), I want to > skip over that and keep parsing. I don't understand how this is supposed to work. You're going to read a large unicode text file (let's say it's UTF-8) into a single big string? So the runtime library has to scan the encoded contents to find the highest numbered codepoint (let's say it's mostly ascii but has a few characters outside the BMP), expand it all (in this case) to UCS-4 giving 4x memory bloat and requiring decoding all the UTF-8 regardless, and now we should worry about the efficiency of skipping n characters? Since you have to decode the n characters regardless, I'd think this skipping part should only be an issue if you have to do it a lot of times. -- http://mail.python.org/mailman/listinfo/python-list