Chris Angelico <ros...@gmail.com>: > On Thu, Mar 30, 2017 at 4:43 PM, Marko Rauhamaa <ma...@pacujo.net> wrote: >> The input is not in my control, and bailing out may not be an option: >> >> $ echo >> aa\n\xdd\naa' | grep aa >> aa >> aa >> $ echo \xdd' | python2 -c 'import sys; sys.stdin.read(1)' >> $ echo \xdd' | python3 -c 'import sys; sys.stdin.read(1)' >> Traceback (most recent call last): >> File "<string>", line 1, in <module> >> File "/usr/lib64/python3.5/codecs.py", line 321, in decode >> (result, consumed) = self._buffer_decode(data, self.errors, final) >> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 0: >> invalid continuation byte >> >> Note that "grep" is also locale-aware. > > So what exactly does byte value 0xDD mean in your stream? > > And if you say "it doesn't matter", then why are you assigning meaning > to byte value 0x0A in your first example? Truly binary data doesn't > give any meaning to 0x0A.
What I'm saying is that every program must behave in a minimally controlled manner regardless of its inputs (which are not in its control). With UTF-8, it is dangerously easy to write programs that explode surprisingly. What's more, resyncing after such exceptions is not at all easy. I would venture to guess that few Python programs even try to do that. Marko -- https://mail.python.org/mailman/listinfo/python-list