Marc 'BlackJack' Rintsch schrieb: > In <[EMAIL PROTECTED]>, Thomas Ploch > wrote: > >> This is how my regexes look like: >> >> import re >> >> class Tags: >> def __init__(self, sourceText): >> self.source = sourceText >> self.curPos = 0 >> self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*" >> self.tagPattern = re.compile("<(?P<name>%s)(?P<attr>[^>]*)>" >> % self.namePattern) >> self.attrPattern = re.compile( >> r"\s+(?P<attrName>%s)\s*=\s*(?P<value>\"[^\"]*\"|'[^']*')" >> % self.namePattern) > > Have you tested this with tags inside comments?
No, but I already see your point that it will parse _all_ tags, even if they are commented out. I am thinking about how to solve this. Probably I just take the chunks between comments and feed it to the regular expressions. >>>> You are probably right. For me it boils down to these problems: >>>> - Implementing a stack for large queues of documents which is faster >>>> than list.pop(index) (Is there a lib for this?) >>> If you need a queue then use one: take a look at `collections.deque` or >>> the `Queue` module in the standard library. >> Which of the two would you recommend for handling large queues with fast >> response times? > > `Queue.Queue` builds on `collections.deque` and is thread safe. Speedwise > I don't think this makes a difference as the most time is spend with IO > and parsing. So if you make your spider multi-threaded to gain some speed > go with `Queue.Queue`. I think I will go for collections.deque (since I have no intention of making it multi-threaded) and have several queues, one for each server in a list to actually finish one server before being directed to the next one straight away (Is this a good approach?). Thanks a lot, Thomas -- http://mail.python.org/mailman/listinfo/python-list