Re: UTF16, BOM, and Windows Line endings

2006-02-07 Thread Fuzzyman
Neil Hodgson wrote: > Fuzzyman: > > > Thanks - so I need to decode to unicode and *then* split on line > > endings. Problem is, that means I can't use Python to handle line > > endings where I don't know the encoding in advance. > > > > In another thread I've posted a small function that *guesses*

Re: UTF16, BOM, and Windows Line endings

2006-02-06 Thread Neil Hodgson
Fuzzyman: > Thanks - so I need to decode to unicode and *then* split on line > endings. Problem is, that means I can't use Python to handle line > endings where I don't know the encoding in advance. > > In another thread I've posted a small function that *guesses* line > endings in use. You

Re: UTF16, BOM, and Windows Line endings

2006-02-06 Thread Fuzzyman
Neil Hodgson wrote: > Fuzzyman: > > > How should I handle line-endings for UTF16 ? Is it possible that other > > programs (on windows) will have line endings as u'\r\n' ? > > Yes, try Notepad and save as Unicode. For the text > > Fuzzy > End of lines > > >>> contents = open("C:\\fuzzy.txt", "

Re: UTF16, BOM, and Windows Line endings

2006-02-06 Thread Neil Hodgson
Fuzzyman: > How should I handle line-endings for UTF16 ? Is it possible that other > programs (on windows) will have line endings as u'\r\n' ? Yes, try Notepad and save as Unicode. For the text Fuzzy End of lines >>> contents = open("C:\\fuzzy.txt", "rb").read() >>> contents '\xff\xfeF\x

UTF16, BOM, and Windows Line endings

2006-02-06 Thread Fuzzyman
Hello all, I'm handling some text files where I don't (necessarily) know the encoding beforehand. Because I use regular expressions to parse the text I *must* decode UTF16 encoded text (otherwise the regexes split on byte boundaries). I can recognise UTF8 and BOM and remove (but not necessarily d