Peter Otten wrote: > I hope you'll let us know how much faster your > final approach turns out to be
OK, here's a short report on the current state. Such code as there is can be found at <http://svn.thomas-lotze.de/PyASDF/pyasdf/_frankenstring.c>, with a Python mock-up in the same directory. Thinking about it (Andreas, thank you for the reminder :o)), doing character-by-character scanning in Python is stupid, both in terms of speed and, given some more search capabilities than str currently has, elegance. So what I did until now (except working myself into writing extensions in C) is give the evolving FrankenString some search methods that enable searching for the first occurrence in the string of any character out of a set of characters given as a string, or any character not in such a set. This has nothing to do yet with iterators and seeking/telling. Just letting C do the "while data[index] not in whitespace: index += 1" part speeds up my PDF tokenizer by a factor between 3 and 4. I have never compared that directly to using regular expressions, though... As a bonus, even with this minor addition the Python code looks a little cleaner already: c = data[cursor] while c in whitespace: # Whitespace tokens. cursor += 1 if c == '%': # We're just inside a comment, read beyond EOL. while data[cursor] not in "\r\n": cursor += 1 cursor += 1 c = data[cursor] becomes cursor = data.skipany(whitespace, start) c = data[cursor] while c == '%': # Whitespace tokens: comments till EOL and whitespace. cursor = data.skipother("\r\n", cursor) cursor = data.skipany(whitespace, cursor) c = data[cursor] (removing '%' from the whitespace string, in case you wonder). The next thing to do is make FrankenString behave. Right now there's too much copying of string content going on everytime a FrankenString is initialized; I'd like it to share string content with other FrankenStrings or strs much like cStringIO does. I hope it's just a matter of learning from cStringIO. To justify the "franken" part of the name some more, I consider mixing in yet another ingredient and making the thing behave like a buffer in that a FrankenString should be possible to make from only part of a string without copying data. After that, the thing about seeking and telling iterators over characters or search results comes in. I don't think it will make much difference in performance now that the stupid character searching has been done in C, but it'll hopefully make for more elegant Python code. -- Thomas -- http://mail.python.org/mailman/listinfo/python-list