On Sep 3, 2019, at 06:17, Rhodri James <[email protected]> wrote:
> 
>> On 03/09/2019 13:31, Chris Angelico wrote:
>>> On Tue, Sep 3, 2019 at 10:27 PM Rhodri James <[email protected]> wrote:
>>> 
>>>> On 31/08/2019 12:31, Chris Angelico wrote:
>>>> We call it a string, but a bytes object has as much in common with
>>>> bytearray and with a list of integers as it does with a text string.
>>> 
>>> You say that as if text strings aren't sequences of bytes.  Complicated
>>> and restricted sequences, I grant you, but no more so than a packet for
>>> a given network protocol.
>>> 
>> A text string is a sequence of characters. By "byte", I really mean
>> "octet", but Python prefers to say "byte".
> 
> And a character is a byte or sequence of bytes. (Odd-sized bytes are pretty 
> much history now, so for non-pendantic usages "byte" is good enough.)

Forget about bytes vs. octets; this still isn’t a useful perspective.

A character is a grapheme cluster, a sequence or one or more code points. A 
code point is an integer between 0 and 1.1M. A string is a flattened sequence 
of grapheme clusters—that is, a sequence of code points. (Python ignores the 
cluster part, pretending code points are characters, at the cost of requiring 
every application to handle normalization manually. Which is normally a good 
tradeoff, but it does mean that you can’t even say whether two sequences of 
code points are the same string without calling a function.)

Meanwhile, there are multiple ways to store those code points as bytes. Python 
does whatever it wants under the covers, hiding it from the user. Obviously 
there is _some_ array of bytes somewhere in memory that represents the 
characters of the string in some way (I say “obviously”, but that isn’t always 
true in Swift, and isn’t even frequently true in Haskell…), but you don’t have 
access to that. If you want a sequence of bytes, you have to ask for a sequence 
in some specific representation, like UTF-8 or UTF-16-BE or Shift-JIS, which it 
creates for you on the fly (albeit cached in a few special cases).

So, from your system programmer’s perspective, in what useful sense is a 
character, or a string, a sequence of bytes?

And this is all still ignoring the fact that in Python, all values are “boxed” 
in an opaque structure that you can’t access from within the language, and even 
from the C API of CPython the box structure isn’t part of the API, so even 
something simpler like, say, an int isn’t usefully a sequence of 30-bit digits 
from the system programmer’s perspective, it’s an opaque handle that you can 
pass to functions to _obtain_ a sequence of 30-bit digits. (In the case of 
strings, you have to first pass to opaque handle to one function to see what 
format to ask for, then pass it to another to obtain a sequence of 1, 2, or 
4-byte integers representing the code points in native-endian ASCII, UCS2, or 
UCS4. Which normally you don’t do—you ask for a UTF-8 string or a UTF-32 string 
that may get constructed on the fly—but if you really do want the actual 
storage, this is the way to get it.)

And most of this is not peculiar to Python. In Swift, a string is a sequence of 
grapheme clusters. In Java, it’s a sequence of UTF-16 code units. In Go, it’s a 
sequence of UTF-8 code units. In Haskell, it’s a lazy linked list of code 
points. And so on. In some of those cases, a character does happen to be 
represented as a string of bytes within a larger representation, but even when 
it is, that still doesn’t mean you can usefully access it that way.

Of course a text file on disk is a sequence or bytes, and (if you know the 
encoding and normalization) you could operate directly on those. But you don’t; 
you pass the byte strings to a function that decodes them (and then sometimes 
to a second function that normalizes them into a canonical form) and then use 
your language’s string functions on the result. In fact, you probably don’t 
even do that; you let the file object buffer the byte strings however it wants 
to and just hand you decoded text objects, so you don’t even know which byte 
substrings exist in memory at any given time.(Languages with powerful 
optimizers or macro systems like Haskell or Rust might actually do that by 
translating all your string-function calls into calls directly on the steam of 
bytes, but from your perspective that’s entirely under the covers, and you’re 
doing the same thing you do in Python.)

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/WUOPKW5KCTEJVC6APXRBJYKWVLB5ISHQ/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to