On Sat, 21 Jan 2017 11:45 pm, Tim Chase wrote: > but I'm hard-pressed to come up with any use case where direct > indexing into a (non-byte)string makes sense unless you've already > processed/searched up to that point and can use a recorded index > from that processing/search.
Let's take a simple example: you do a find to get an offset, and then slice from that offset. py> text = "αβγдлфxx" py> offset = text.find("ф") py> stuff = text[offset:] py> assert stuff == "фxx" That works fine whether indexing refers to code points or bytes. py> "αβγдлфxx".find("ф") 5 py> "αβγдлфxx".encode('utf-8').find("ф".encode('utf-8')) 10 Either way, you get the expected result. However: py> stuff = text[offset + 1:] py> assert stuff == "xx" That requires indexes to point to the beginning of *code points*, not bytes: taking byte 11 of "αβγдлфxx".encode('utf-8') drops you into the middle of the ф representation: py> "αβγдлфxx".encode('utf-8')[11:] b'\x84xx' and it isn't a valid UTF-8 substring. Slicing would generate an exception unless you happened to slice right at the start of a code point. It's like seek() and tell() on text files: you cannot seek to arbitrary positions, but only to the opaque positions returned by tell. That's unacceptable for strings. You could avoid that error by increasing the offset by the right amount: stuff = text[offset + len("ф".encode('utf-8'):] which is awful. I believe that's what Go and Julia expect you to do. Another solution would be to have the string slicing method automatically scan forward to the start of the next valid UTF-8 code point. That would be the "Do What I Mean" solution. The problem with the DWIM solution is that not only is it adding complexity, but it's frankly *weird*. It would mean: - if the character at position `offset` fits in 2 bytes: text[offset+1:] == text[offset+2:] - if it fits in 3 bytes: text[offset+1:] == text[offset+2:] == text[offset+3:] - and if it fits in 4 bytes: text[offset+1:] == text[offset+2:] == text[offset+3:] == text[offset+4:] Having the string slicing method Do The Right Thing would actually be The Wrong Thing. It would make it awful to reason about slicing. You can avoid this by having the interpreter treat the Python-level indexes as opaque "code point offsets", and converting them to and from "byte offsets" as needed. That's not even very hard. But it either turns every indexing into O(N) (since you have to walk the string to count which byte represents the nth code point), or you have to keep an auxiliary table with every string, letting you convert from byte indexes to code point indexes quickly, but that will significantly increase the memory size of every string, blowing out the advantage of using UTF-8 in the first place. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list