On Jun 6, 12:14 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > Paulo da Silva wrote: > > Em 06-06-2010 00:41, Chris Rebert escreveu: > >> On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva > >> <psdasilva.nos...@netcabonospam.pt> wrote: > > ... > > >> Specify the encoding of the text when opening the file using the > >> `encoding` parameter. For Windows-1252 for example: > > >> your_file = open("path/to/file.ext", 'r', encoding='cp1252') > > > OK! This fixes my current problem. I used encoding="iso-8859-15". This > > is how my text files are encoded. > > But what about a more general case where the encoding of the text file > > is unknown? Is there anything like "autodetect"? > > > > An encoding like 'cp1252' uses 1 byte/character, but so does 'cp1250'. > How could you tell which was the correct encoding? > > Well, if the file contained words in a certain language and some of the > characters were wrong, then you'd know that the encoding was wrong. This > does imply, though, that you'd need to know what the language should > look like! > > You could try different encodings, and for each one try to identify what > could be words, then look them up in dictionaries for various languages > to see whether they are real words...
This has been automated (semi-successfully, with caveats) by the chardet package ... see http://chardet.feedparser.org/ -- http://mail.python.org/mailman/listinfo/python-list