On 8/27/2017 1:31 PM, Peter Otten wrote:

Here's a simple example that extracts titles from generated html. It seems
to work. Does it resemble what you do?
Your example is similar to my code when I'm using a list for the input to the parser. You have soup_threads and write_threads, but no read_threads.

The particular website I'm scraping requires checking each page for the sentinel value (i.e., "Sorry, no more comments") in order to determine when to stop requesting pages. For my comment history that's ~750 pages to parse ~11,000 comments.

I have 20 read_threads requesting and putting pages into the output queue that is the input_queue for the parser. My soup_threads can get items from the queue, but BeautifulSoup doesn't do anything after that.

Chris R.
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to