2009/7/5 Hendrik van Rooyen <m...@microcorp.co.za>: > I cannot see how you could avoid a python function call - even if he > bites the bullet and implements my laborious scheme, he would still > have to fetch the next character to test against, inside the current state. > > So if it is the function calls that is slowing him down, I cannot > imagine a solution using less than one per character, in which > case he is screwed no matter what he does.
A simple solution may be to read the whole input HTML file in a string. This potentially requires lots of memory but I suspect that the use case by far most common for this parser is to build a DOM (or DOM-like) tree of the whole document. This tree usually requires much more memory that the HTML source itself. So, if the code duplication is acceptable, I suggest keeping this implementation for cases where the input is extremely big *AND* the whole program will work on it in "streaming", not just the parser itself. Then write a simpler and faster parser for the more common case when the data is not huge *OR* the user will keep the whole document in memory anyway (e.g. on a tree). Also: profile, profile a lot. HTML pages are very strange beasts and the bottlenecks may be in innocent-looking places! -- Lino Mastrodomenico -- http://mail.python.org/mailman/listinfo/python-list