Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

Chris Angelico Thu, 28 Mar 2013 08:13:32 -0700

On Fri, Mar 29, 2013 at 1:51 AM, MRAB <pyt...@mrabarnett.plus.com> wrote:
> On 28/03/2013 12:11, Neil Hodgson wrote:
>>
>> Ian Foote:
>>
>>> Specifically, indexing a variable-length encoding like utf-8 is not
>>> as efficient as indexing a fixed-length encoding.
>>
>>
>> Many common string operations do not require indexing by character
>> which reduces the impact of this inefficiency. UTF-8 seems like a
>> reasonable choice for an internal representation to me. One benefit
>> of UTF-8 over Python's flexible representation is that it is, on
>> average, more compact over a wide set of samples.
>>
> Implementing the regex module (http://pypi.python.org/pypi/regex) would
> have been more difficult if the internal representation had been UTF-8,
> because of the need to decode, and the implementation would also have
> been slower for that reason.


In fact, nearly ALL string parsing operations would need to be done
differently. The only method that I can think of that wouldn't be
impacted is a linear state-machine parser - something that could be
written inside a "for character in string" loop.

text = []

def initial(c):
    global state
    if c=='<': state=tag
    else: text.append(c)

def tag(c):
    global state
    if c=='>': state=initial

state = initial
for character in string:
    state(character)

print(''.join(text))


I'm pretty sure this will run in O(N) time, even with UTF-8 strings.
But it's an *extremely* simple parser.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

Reply via email to