Hello all, I'm handling some text files where I don't (necessarily) know the encoding beforehand. Because I use regular expressions to parse the text I *must* decode UTF16 encoded text (otherwise the regexes split on byte boundaries).
I can recognise UTF8 and BOM and remove (but not necessarily decode). For UTF16 it seems that the Python codec will automatically remove the BOM. Having detected it (to trigger a decode) is it considered *invalid* to remove it ? The codec certainly handles the text without a BOM - I just don't want this part of the code to break later. Because I don't know the encoding until I've checked for the BOM I have to read in binary mode. Similarly I have to write in binary mode. How should I handle line-endings for UTF16 ? Is it possible that other programs (on windows) will have line endings as u'\r\n' ? When saving files for that platform should I make the line endings u'\r\n' ? (This sequence obviously encodes to four bytes in UTF16). I would only do this to ensure compatibility with other programs the user may use to create the text files. All the best, Fuzzyman http://www.voidspace.org.uk/python/index.shtml -- http://mail.python.org/mailman/listinfo/python-list