On Thu, Jun 6, 2013 at 4:22 PM, Nobody <nob...@nowhere.com> wrote: > On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote: > >> The HTTP header is completely out of band. This is the best way to >> transmit encoding information. Otherwise, you assume 7-bit ASCII and start >> parsing. Once you find a meta tag, you stop parsing and go back to the >> top, decoding in the new way. > > Provided that the meta tag indicates an ASCII-compatible encoding, and you > haven't encountered any decode errors due to 8-bit characters, then > there's no need to go back to the top.
Technically and conceptually, you go back to the start and re-parse. Sure, you might optimize that if you can, but not every parser will, hence it's advisable to put the content-type as early as possible. >> "ASCII-compatible" covers a huge number of >> encodings, so it's not actually much of a problem to do this. > > With slight modifications, you can also handle some > almost-ASCII-compatible encodings such as shift-JIS. > > Personally, I'd start by assuming ISO-8859-1, keep track of which bytes > have actually been seen, and only re-start parsing from the top if the > encoding change actually affects the interpretation of any of those bytes. Hrm, it'd be equally valid to guess UTF-8. But as long as you're prepared to re-parse after finding the content-type, that's just a choice of optimization and has no real impact. ChrisA -- http://mail.python.org/mailman/listinfo/python-list