On Sun, 8 May 2022 at 07:19, Stefan Ram <r...@zedat.fu-berlin.de> wrote: > > MRAB <pyt...@mrabarnett.plus.com> writes: > >On 2022-05-07 19:47, Stefan Ram wrote: > ... > >>def encoding( name ): > >> path = pathlib.Path( name ) > >> for encoding in( "utf_8", "latin_1", "cp1252" ): > >> try: > >> with path.open( encoding=encoding, errors="strict" )as file: > >> text = file.read() > >> return encoding > >> except UnicodeDecodeError: > >> pass > >> return "ascii" > >>Yes, it's potentially slow and might be wrong. > >>The result "ascii" might mean it's a binary file. > >"latin-1" will decode any sequence of bytes, so it'll never try > >"cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong > >anyway because the file could contain 0x80..0xFF, which aren't supported > >by that encoding. > > Thank you! It's working for my specific application where > I'm reading from a collection of text files that should be > encoded in either utf_8, latin_1, or ascii. >
In that case, I'd exclude ASCII from the check, and just check UTF-8, and if that fails, decode as Latin-1. Any ASCII files will decode correctly as UTF-8, and any file will decode as Latin-1. I've used this exact fallback system when decoding raw data from Unicode-naive servers - they accept and share bytes, so it's entirely possible to have a mix of encodings in a single stream. As long as you can define the span of a single "unit" (say, a line, or a chunk in some form), you can read as bytes and do the exact same "decode as UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not perfectly ideal, but it's about as good as you'll get with a lot of US-based servers. (Depending on context, you might use CP-1252 instead of Latin-1, but you might need errors="replace" there, since Windows-1252 has some undefined byte values.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list