On Sat, 19 Mar 2016 02:31 am, Random832 wrote: > On Fri, Mar 18, 2016, at 11:17, Ian Kelly wrote: >> > Just to play devil's advocate, here, why is it so bad for indexing to >> > be O(n)? Some simple caching is all that's needed to prevent it from >> > making iteration O(n^2), if that's what you're worried about. >> >> What kind of caching do you have in mind? > > The byte index of the last character index accessed.
Some people, when faced with a problem, think, I know, I'll use a cache. Now they have 99 problems (but a bitch ain't one). > When accessing a > new index greater than that one, start from there (_maybe_ also support > iterating backwards in this way, if accessing an index that is much > closer to the cached one than to zero). And even that's only necessary > if you actually _care_ about forward iteration by character indices > (i.e. "for i in range(len(s))"), rather than writing it off as bad > coding style. Without locking, this kills thread safety for strings. With locking, it probably kills performance. Worse, even in single-threaded code, functions can mess up the cache, destroying any usefulness it may have: x = mystring[9999] # sets the cache to index 9999 spam(mystring) # calls mystring[0], setting the cache back to 0 y = mystring[10000] And I don't understand this meme that indexing strings is not important. Have people never (say) taken a slice of a string, or a look-ahead, or something similar? i = mystring.find(":") next_char = mystring[i+1] # Strip the first and last chars from a string mystring[1:-1] Perhaps you might argue that for some applications, O(N) string operations don't matter. After all, in Linux, terminals usually are set to UTF-8, and people can copy and paste substrings out of the buffer, etc. So there may be a case to say that for applications with line-oriented buffers typically containing 70 or 170 characters per line, like a terminal, UTF-8 is perfectly adequate. I have no argument against that. But I think that for a programming language which may be dealing with strings of multiple tens of megabytes in size, I'm skeptical that UTF-8 doesn't cause a performance hit for at least some operations. >> It's not the only drawback, either. If you want to know anything about >> the characters in the string that you're looking at, you need to know >> their codepoints. > > Nonsense. That depends on what you want to know about it. You can > extract a single character from a string, as a string, without knowing > anything about it except what range the first byte is in. You can use > this string directly as an index to a hash table containing information > such as unicode properties, names, etc. I don't understand your comment. If I give you the index of the character, how do you know where its first byte is? With UTF-8, character i can be anywhere between byte i and 4*i. (And by character I actually mean code point -- let's not get into arguments about normalisation.) >> If the string is simple UCS-2, that's easy. Hmmm, well, nobody uses UCS-2 any more, since that only covers the first 65536 code points. Rather, languages like Javascript and Java, and the Windows OS, use UTF-16, which is a *variable width* extension to UCS-2. I don't know about Windows, but Javascript implements this badly, so that 4-byte UTF-16 code points are treated as *two* surrogate code points instead of the single code point they are meant to be. We can get the same result in Python 2.7 narrow builds: py> s = u'\U0010FFFF' # definitely a single code point py> len(s) # WTF? 2 py> s[0] # a surrogate. u'\udbff' py> s[1] # another surrogate u'\udfff' The only fixed-width encoding of the entire Unicode character set is UTF-32. -- Steven -- https://mail.python.org/mailman/listinfo/python-list