On Mon, 9 May 2022 at 04:15, Barry Scott <ba...@barrys-emacs.org> wrote: > > > > > On 7 May 2022, at 22:31, Chris Angelico <ros...@gmail.com> wrote: > > > > On Sun, 8 May 2022 at 07:19, Stefan Ram <r...@zedat.fu-berlin.de> wrote: > >> > >> MRAB <pyt...@mrabarnett.plus.com> writes: > >>> On 2022-05-07 19:47, Stefan Ram wrote: > >> ... > >>>> def encoding( name ): > >>>> path = pathlib.Path( name ) > >>>> for encoding in( "utf_8", "latin_1", "cp1252" ): > >>>> try: > >>>> with path.open( encoding=encoding, errors="strict" )as file: > >>>> text = file.read() > >>>> return encoding > >>>> except UnicodeDecodeError: > >>>> pass > >>>> return "ascii" > >>>> Yes, it's potentially slow and might be wrong. > >>>> The result "ascii" might mean it's a binary file. > >>> "latin-1" will decode any sequence of bytes, so it'll never try > >>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong > >>> anyway because the file could contain 0x80..0xFF, which aren't supported > >>> by that encoding. > >> > >> Thank you! It's working for my specific application where > >> I'm reading from a collection of text files that should be > >> encoded in either utf_8, latin_1, or ascii. > >> > > > > In that case, I'd exclude ASCII from the check, and just check UTF-8, > > and if that fails, decode as Latin-1. Any ASCII files will decode > > correctly as UTF-8, and any file will decode as Latin-1. > > > > I've used this exact fallback system when decoding raw data from > > Unicode-naive servers - they accept and share bytes, so it's entirely > > possible to have a mix of encodings in a single stream. As long as you > > can define the span of a single "unit" (say, a line, or a chunk in > > some form), you can read as bytes and do the exact same "decode as > > UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not > > perfectly ideal, but it's about as good as you'll get with a lot of > > US-based servers. (Depending on context, you might use CP-1252 instead > > of Latin-1, but you might need errors="replace" there, since > > Windows-1252 has some undefined byte values.) > > There is a very common error on Windows that files and especially web pages > that > claim to be utf-8 are in fact CP-1252. > > There is logic in the HTML standards to try utf-8 and if it fails fall back > to CP-1252. > > Its usually the left and "smart" quote chars that cause the issue as they code > as an invalid utf-8. >
Yeah, or sometimes, there isn't *anything* in UTF-8, and it has some sort of straight-up lie in the form of a meta tag. It's annoying. But the same logic still applies: attempt one decode (UTF-8) and if it fails, there's one fallback. Fairly simple. ChrisA -- https://mail.python.org/mailman/listinfo/python-list