Christopher Reimer <christopher_rei...@icloud.com> writes: > I'm developing a web scraper script. It takes 25 minutes to process > 590 pages and ~9,000 comments. I've been told that the script is > taking too long. > > The way the script currently works is that the page requester is a > generator function that requests a page, checks if the page contains > the sentinel text (i.e., "Sorry, no more comments."), and either > yields the page and request the next page or exits the function. Every > yielded page is parsed by Beautiful Soup and saved to disk. > > Timing the page requester separately from the rest of the script and > the end value set to 590, each page request takes 1.5 seconds.
That's very slow to fetch a page. > If I use a thread pool of 16 threads, each request takes 0.1 > seconds. (Higher thread numbers will result in the server forcibly > closing the connection.) > > I'm trying to figure out how I would find the sentinel text by using a > thread pool. Seems like I need to request an arbitrary number of pages > (perhaps one page per thread), evaluate the contents of each page for > the sentinel text, and either request another set of pages or exit the > function. If your (590) pages are linked together (such that you must fetch a page to get the following one) and page fetching is the limiting factor, then this would limit the parallelizability. If processing a selected page takes a significant amount of time (compared to the fetching), then you could use a work queue as follows: a page is fetched and the following page determined; if a following page is found, processing this page is put as a job into the work queue and page processing is continued. Free tasks look for jobs in the work queue and process them. -- https://mail.python.org/mailman/listinfo/python-list