On 2017-01-22 01:44, Steve D'Aprano wrote: > On Sat, 21 Jan 2017 11:45 pm, Tim Chase wrote: > > > but I'm hard-pressed to come up with any use case where direct > > indexing into a (non-byte)string makes sense unless you've already > > processed/searched up to that point and can use a recorded index > > from that processing/search. > > > Let's take a simple example: you do a find to get an offset, and > then slice from that offset. > > py> text = "αβγдлфxx" > py> offset = text.find("ф")
Right, so here, you've done a (likely linear, but however you get here) search, which then makes sense to use this opaque "offset" token for slicing purposes: > py> stuff = text[offset:] > py> assert stuff == "фxx" > That works fine whether indexing refers to code points or bytes. > > py> "αβγдлфxx".find("ф") > 5 > py> "αβγдлфxx".encode('utf-8').find("ф".encode('utf-8')) > 10 > > Either way, you get the expected result. However: > > py> stuff = text[offset + 1:] > py> assert stuff == "xx" > > That requires indexes to point to the beginning of *code points*, > not bytes: taking byte 11 of "αβγдлфxx".encode('utf-8') drops you > into the middle of the ф representation: > > py> "αβγдлфxx".encode('utf-8')[11:] > b'\x84xx' > > and it isn't a valid UTF-8 substring. Slicing would generate an > exception unless you happened to slice right at the start of a code > point. Right. It gets even weirder (edge-case'ier) when dealing with combining characters: >>> s = "man\N{COMBINING TILDE}ana" >>> for i, c in enumerate(s): print("%i: %s" % (i, c)) ... 0: m 1: a 2: n 3:˜ 4: a 5: n 6: a >>> ''.join(reversed(s)) 'anãnam' Offsetting s[3:] produces a (sub)string that begins with a combining character that doesn't have anything preceding it to combine with. > It's like seek() and tell() on text files: you cannot seek to > arbitrary positions, but only to the opaque positions returned by > tell. That's unacceptable for strings. I'm still unclear on *why* this would be considered unacceptable for strings. It makes sense when dealing with byte-strings, since they contain binary data that may need to get sliced at arbitrary offsets. But for strings, slicing only makes sense (for every use-case I've been able to come up with) in the context of known offsets like you describe with tell(). The cost of not using opaque tell()like offsets is, as you describe, slicing in the middle of characters. > You could avoid that error by increasing the offset by the right > amount: > > stuff = text[offset + len("ф".encode('utf-8'):] > > which is awful. I believe that's what Go and Julia expect you to do. It may be awful, but only because it hasn't been pythonified. If the result from calling .find() on a string returns a "StringOffset" object, then it would make sense that its __add__/__radd__ methods would accept an integer and to such translation for you. > You can avoid this by having the interpreter treat the Python-level > indexes as opaque "code point offsets", and converting them to and > from "byte offsets" as needed. That's not even very hard. But it > either turns every indexing into O(N) (since you have to walk the > string to count which byte represents the nth code point) The O(N) cost has to be paid at some point, but I'd put forth that other operations like .find() already pay that O(N) cost and can return an opaque "offset token" that can be subsequently used for O(1) indexing (multiple times if needed). -tkc -- https://mail.python.org/mailman/listinfo/python-list