On 8/27/2017 1:31 PM, Peter Otten wrote:
Here's a simple example that extracts titles from generated html. It seems
to work. Does it resemble what you do?
Your example is similar to my code when I'm using a list for the input
to the parser. You have soup_threads and write_threads, but no read_threads.
The particular website I'm scraping requires checking each page for the
sentinel value (i.e., "Sorry, no more comments") in order to determine
when to stop requesting pages. For my comment history that's ~750 pages
to parse ~11,000 comments.
I have 20 read_threads requesting and putting pages into the output
queue that is the input_queue for the parser. My soup_threads can get
items from the queue, but BeautifulSoup doesn't do anything after that.
Chris R.
--
https://mail.python.org/mailman/listinfo/python-list