On Fri, Mar 29, 2013 at 1:51 AM, MRAB <pyt...@mrabarnett.plus.com> wrote: > On 28/03/2013 12:11, Neil Hodgson wrote: >> >> Ian Foote: >> >>> Specifically, indexing a variable-length encoding like utf-8 is not >>> as efficient as indexing a fixed-length encoding. >> >> >> Many common string operations do not require indexing by character >> which reduces the impact of this inefficiency. UTF-8 seems like a >> reasonable choice for an internal representation to me. One benefit >> of UTF-8 over Python's flexible representation is that it is, on >> average, more compact over a wide set of samples. >> > Implementing the regex module (http://pypi.python.org/pypi/regex) would > have been more difficult if the internal representation had been UTF-8, > because of the need to decode, and the implementation would also have > been slower for that reason.
In fact, nearly ALL string parsing operations would need to be done differently. The only method that I can think of that wouldn't be impacted is a linear state-machine parser - something that could be written inside a "for character in string" loop. text = [] def initial(c): global state if c=='<': state=tag else: text.append(c) def tag(c): global state if c=='>': state=initial state = initial for character in string: state(character) print(''.join(text)) I'm pretty sure this will run in O(N) time, even with UTF-8 strings. But it's an *extremely* simple parser. ChrisA -- http://mail.python.org/mailman/listinfo/python-list