On Sep 3, 2019, at 06:17, Rhodri James <[email protected]> wrote: > >> On 03/09/2019 13:31, Chris Angelico wrote: >>> On Tue, Sep 3, 2019 at 10:27 PM Rhodri James <[email protected]> wrote: >>> >>>> On 31/08/2019 12:31, Chris Angelico wrote: >>>> We call it a string, but a bytes object has as much in common with >>>> bytearray and with a list of integers as it does with a text string. >>> >>> You say that as if text strings aren't sequences of bytes. Complicated >>> and restricted sequences, I grant you, but no more so than a packet for >>> a given network protocol. >>> >> A text string is a sequence of characters. By "byte", I really mean >> "octet", but Python prefers to say "byte". > > And a character is a byte or sequence of bytes. (Odd-sized bytes are pretty > much history now, so for non-pendantic usages "byte" is good enough.)
Forget about bytes vs. octets; this still isn’t a useful perspective. A character is a grapheme cluster, a sequence or one or more code points. A code point is an integer between 0 and 1.1M. A string is a flattened sequence of grapheme clusters—that is, a sequence of code points. (Python ignores the cluster part, pretending code points are characters, at the cost of requiring every application to handle normalization manually. Which is normally a good tradeoff, but it does mean that you can’t even say whether two sequences of code points are the same string without calling a function.) Meanwhile, there are multiple ways to store those code points as bytes. Python does whatever it wants under the covers, hiding it from the user. Obviously there is _some_ array of bytes somewhere in memory that represents the characters of the string in some way (I say “obviously”, but that isn’t always true in Swift, and isn’t even frequently true in Haskell…), but you don’t have access to that. If you want a sequence of bytes, you have to ask for a sequence in some specific representation, like UTF-8 or UTF-16-BE or Shift-JIS, which it creates for you on the fly (albeit cached in a few special cases). So, from your system programmer’s perspective, in what useful sense is a character, or a string, a sequence of bytes? And this is all still ignoring the fact that in Python, all values are “boxed” in an opaque structure that you can’t access from within the language, and even from the C API of CPython the box structure isn’t part of the API, so even something simpler like, say, an int isn’t usefully a sequence of 30-bit digits from the system programmer’s perspective, it’s an opaque handle that you can pass to functions to _obtain_ a sequence of 30-bit digits. (In the case of strings, you have to first pass to opaque handle to one function to see what format to ask for, then pass it to another to obtain a sequence of 1, 2, or 4-byte integers representing the code points in native-endian ASCII, UCS2, or UCS4. Which normally you don’t do—you ask for a UTF-8 string or a UTF-32 string that may get constructed on the fly—but if you really do want the actual storage, this is the way to get it.) And most of this is not peculiar to Python. In Swift, a string is a sequence of grapheme clusters. In Java, it’s a sequence of UTF-16 code units. In Go, it’s a sequence of UTF-8 code units. In Haskell, it’s a lazy linked list of code points. And so on. In some of those cases, a character does happen to be represented as a string of bytes within a larger representation, but even when it is, that still doesn’t mean you can usefully access it that way. Of course a text file on disk is a sequence or bytes, and (if you know the encoding and normalization) you could operate directly on those. But you don’t; you pass the byte strings to a function that decodes them (and then sometimes to a second function that normalizes them into a canonical form) and then use your language’s string functions on the result. In fact, you probably don’t even do that; you let the file object buffer the byte strings however it wants to and just hand you decoded text objects, so you don’t even know which byte substrings exist in memory at any given time.(Languages with powerful optimizers or macro systems like Haskell or Rust might actually do that by translating all your string-function calls into calls directly on the steam of bytes, but from your perspective that’s entirely under the covers, and you’re doing the same thing you do in Python.) _______________________________________________ Python-ideas mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/WUOPKW5KCTEJVC6APXRBJYKWVLB5ISHQ/ Code of Conduct: http://python.org/psf/codeofconduct/
