On 7/15/2018 7:37 AM, Marko Rauhamaa wrote:
One of the classic Unix and Internet tenets is that text is bytes is
text.
Tenets of a faith may be wrong ;-). An informatic paradigm from more
than 45 years ago may be outdated and in need of revision.
On byte storage and on the Internet, **everything** is (encoded) bytes,
so saying 'text is bytes' says nothing because it is trivially true. On
the other hand, 'bytes is text' is wrong unless one uses a character
encoding that assigns a visible character (including <space>) to every
byte. I believe both PCs and Macs had 1 or more such encodings. (I am
only uncertain as to whether b'\x00' was mapped.)
Images are bytes as much as text is. I suggest that 'bytes is image' is
more true than 'bytes is text'. Every byte can be mapped, for instance,
into an 8 x 1 or 1 x 8 pixel image after deciding which end gets the
high and low bits. Bit mapping is likely older than Unix. Bar codes
and QR codes are commonplace as international machine-readable images of
bytes.
In a context where 'everything is bytes', then 'bytes is everything' or
'bytes can be anything' are the proper reverses.
Of course, much of it was naïve, but UTF-8 has miraculously given
it a new life.
UTF-8 makes 'bytes is text' even less true. Not only are some leading
bytes not text, but some byte sequences are illegal. Bytes are not
UTF-8 text. As n increases, the probability that a string of n random
bytes will be utf-8 text approaches 0 faster than interpreting the same
bytes as Latin1.
--
Terry Jan Reedy
--
https://mail.python.org/mailman/listinfo/python-list