Marc 'BlackJack' Rintsch schrieb: > In <[EMAIL PROTECTED]>, Thomas Ploch > wrote: > >> Alright, my prof said '... to process documents written in structural >> markup languages using regular expressions is a no-no.' (Because of >> nested Elements? Can't remember) So I think he wants us to use regexes >> to learn them. He is pointing to HTMLParser though. > > Problem is that much of the HTML in the wild is written in a structured > markup language but it's in many cases broken. If you just search some > words or patterns that appear somewhere in the documents then regular > expressions are good enough. If you want to actually *parse* HTML "from > the wild" better use the BeautifulSoup_ parser. > > .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
Yes, I know about BeautifulSoup. But as I said it should be done with regexes. I want to extract tags, and their attributes as a dictionary of name/value pairs. I know that most of HTML out there is *not* validated and bollocks. This is how my regexes look like: import re class Tags: def __init__(self, sourceText): self.source = sourceText self.curPos = 0 self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*" self.tagPattern = re.compile("<(?P<name>%s)(?P<attr>[^>]*)>" % self.namePattern) self.attrPattern = re.compile( r"\s+(?P<attrName>%s)\s*=\s*(?P<value>\"[^\"]*\"|'[^']*')" % self.namePattern) >> You are probably right. For me it boils down to these problems: >> - Implementing a stack for large queues of documents which is faster >> than list.pop(index) (Is there a lib for this?) > > If you need a queue then use one: take a look at `collections.deque` or > the `Queue` module in the standard library. Which of the two would you recommend for handling large queues with fast response times? Thomas -- http://mail.python.org/mailman/listinfo/python-list