On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman <petef4+use...@gmail.com> wrote: > I was asserting that most useful operations on strings start from index > 0. The r* operations would not be slowed down that much as UTF-8 has the > useful property that attempting to interpret from a byte that is not at > the start of a sequence (in the sense of a code point rather than > Python) is invalid and so quick to move over while working backwards > from the end.
Let's take one very common example: decoding JSON. A ton of web servers out there will call json.loads() on user-supplied data. The bulk of the work is in the scanner, which steps through the string and does the actual parsing. That function is implemented in Python, so it's a good example. (There is a C accelerator, but we can ignore that and look at the pure Python one.) So, how could you implement this function? The current implementation maintains an index - an integer position through the string. It repeatedly requests the next character as string[idx], and can also slice the string (to check for keywords like "true") or use a regex (to check for numbers). Everything's clean, but it's lots of indexing. Alternatively, it could remove and discard characters as they're consumed. It would maintain a string that consists of all the unparsed characters. All indexing would be at or near zero, but after every tiny piece of parsing, the string would get sliced. With immutable UTF-8 strings, both of these would be O(n^2). Either indexing is linear, so parsing the tail of the string means scanning repeatedly; or slicing is linear, so parsing the head of the string means slicing all the rest away. The only way for it to be fast enough would be to have some sort of retainable string iterator, which means exposing an opaque "position marker" that serves no purpose other than parsing. Every string parse operation would have to be reimplemented this way, lest it perform abysmally on large strings. It'd mean some sort of magic "thing" that probably has a reference to the original string, so you don't get the progressive RAM refunds that slicing gives, and you'd still have to deal with lots of the other consequences. It's probably doable, but it would be a lot of pain. ChrisA -- https://mail.python.org/mailman/listinfo/python-list