On Fri, Mar 18, 2016, at 12:44, Steven D'Aprano wrote: > And I don't understand this meme that indexing strings is not important. > Have people never (say) taken a slice of a string, or a look-ahead, or > something similar? > > i = mystring.find(":")
find is already O(N). > next_char = mystring[i+1] > > # Strip the first and last chars from a string > mystring[1:-1] slicing is already O(N) in the size of the slice... adding O(N) in your indices (which are =1) isn't a significant addition. > >> It's not the only drawback, either. If you want to know anything about > >> the characters in the string that you're looking at, you need to know > >> their codepoints. > > > > Nonsense. That depends on what you want to know about it. You can > > extract a single character from a string, as a string, without knowing > > anything about it except what range the first byte is in. You can use > > this string directly as an index to a hash table containing information > > such as unicode properties, names, etc. > > I don't understand your comment. If I give you the index of the > character, > how do you know where its first byte is? Er, I thought we were talking about the assertion that you can't do anything with the character you *already have* the byte index for without decoding it to a code point. My point is that you can determine the number of bytes in the character without decoding it (fully), you only need to look at the first byte. Especially if all strings are guaranteed to be valid UTF-8. Look at first byte. First bit is 0, so it's only one byte. Look at second byte. First three bits are 110, so it's two bytes. Look at fourth byte. First four bits are 1110, so it's three bytes. Look at seventh byte. First four bits are 1111, so it's four bytes. For this, you've never looked at the last four bits of any of those bytes, or any bits of any of the other bytes. For iteration, you could simply count how many bytes you encounter whose first two bits aren't 10, until you reach the desired number. Simpler algorithm, and works forward and backward. You only need to do what I mentioned above to extract a character. My point is, neither process requires you to assemble all the bits into a complete codepoint. > With UTF-8, character i can be > anywhere between byte i and 4*i. -- https://mail.python.org/mailman/listinfo/python-list