Chris Angelico writes: > On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman wrote:
>> I was asserting that most useful operations on strings start from >> index 0. The r* operations would not be slowed down that much as >> UTF-8 has the useful property that attempting to interpret from a >> byte that is not at the start of a sequence (in the sense of a code >> point rather than Python) is invalid and so quick to move over while >> working backwards from the end. > > Let's take one very common example: decoding JSON. A ton of web > servers out there will call json.loads() on user-supplied data. The > bulk of the work is in the scanner, which steps through the string and > does the actual parsing. That function is implemented in Python, so > it's a good example. (There is a C accelerator, but we can ignore that > and look at the pure Python one.) > > So, how could you implement this function? The current implementation > maintains an index - an integer position through the string. It > repeatedly requests the next character as string[idx], and can also > slice the string (to check for keywords like "true") or use a regex > (to check for numbers). Everything's clean, but it's lots of indexing. > Alternatively, it could remove and discard characters as they're > consumed. It would maintain a string that consists of all the unparsed > characters. All indexing would be at or near zero, but after every > tiny piece of parsing, the string would get sliced. > > With immutable UTF-8 strings, both of these would be O(n^2). Either > indexing is linear, so parsing the tail of the string means scanning > repeatedly; or slicing is linear, so parsing the head of the string > means slicing all the rest away. > > The only way for it to be fast enough would be to have some sort of > retainable string iterator, which means exposing an opaque "position > marker" that serves no purpose other than parsing. Every string parse > operation would have to be reimplemented this way, lest it perform > abysmally on large strings. It'd mean some sort of magic "thing" that > probably has a reference to the original string, so you don't get the > progressive RAM refunds that slicing gives, and you'd still have to > deal with lots of the other consequences. It's probably doable, but it > would be a lot of pain. Julia does this. It has immutable UTF-8 strings, and there is a JSON parser. The "opaque position marker" is just the byte index. An attempt to use an invalid index throws an error. A substring type points to an underlying string. An iterator, called graphemes, even returns substrings that correspond to what people might consider a character. I offer Julia as evidence. My impression is that Julia's UTF-8-based system works and is not a pain. I wrote a toy function once to access the last line of a large memory-mapped text file, so I have just this little bit of personal experience of it, so far. Incidentally, can Python memory-map a UTF-8 file as a string? http://docs.julialang.org/en/stable/manual/strings/ https://github.com/JuliaIO/JSON.jl -- https://mail.python.org/mailman/listinfo/python-list