On Fri, Mar 18, 2016 at 6:37 AM, Chris Angelico <ros...@gmail.com> wrote: > On Fri, Mar 18, 2016 at 10:46 PM, Steven D'Aprano <st...@pearwood.info> wrote: >> Technically, UTF-8 doesn't *necessarily* imply indexing is O(n). For >> instance, your UTF-8 string might consist of an array of bytes containing >> the string, plus an array of indexes to the start of each code point. For >> example, the string: >> >> “abcπßЊ•𒀁” >> >> (including the quote marks) is 10 code points in length and 22 bytes as >> UTF-8. Grouping the (hex) bytes for each code point, we have: >> >> e2809c 61 62 63 cf80 c39f d08a e280a2 f0928081 e2809d >> >> so we could get a O(1) UTF-8 string by recording the bytes (in hex) plus the >> indexes (in decimal) in which each code point starts: >> >> e2809c616263cf80c39fd08ae280a2f0928081e2809d >> >> 0 3 4 5 6 8 10 12 15 19 >> >> but (assuming each index needs 2 bytes, which supports strings up to 65535 >> characters in length), that's actually LESS memory efficient than UTF-32: >> 42 bytes versus 40. > > A lot of strings will have no more than 255 non-ASCII characters in > them. (For example, all strings which no more than 255 total > characters.) You could store, instead of the indexes themselves, a > series of one-byte offsets: > > e2809c616263cf80c39fd08ae280a2f0928081e2809d > 0 2 2 2 2 3 4 5 7 10 > > Locating a byte based on its character position is still O(1); you > look up that position in the offset table, add that to your original > character position, and you have the byte location. For strings with > too many non-ASCII codepoints, you'd need some other representation, > but at that point, it might be worth just switching to UTF-32.
So this uses approximately twice as much memory as the FSR and still requires switching on some form of character width in the implementation? Yeah, I don't think the RUE is going to go for that. 8-) -- https://mail.python.org/mailman/listinfo/python-list