Re: [Tutor] Recursion depth exceeded in python web crawler

Mark Lawrence Thu, 14 Jun 2018 19:04:02 -0700

On 14/06/18 19:32, Daniel Bosah wrote:

I am trying to modify code from a web crawler to scrape for keywords from
certain websites. However, Im trying to run the web crawler before  I
modify it, and I'm running into issues.


When I ran this code -




*import threading*
*from Queue import Queue*
*from spider import Spider*
*from domain import get_domain_name*
*from general import file_to_set*


*PROJECT_NAME = "SPIDER"*
*HOME_PAGE = "https://www.cracked.com/ <https://www.cracked.com/>"*
*DOMAIN_NAME = get_domain_name(HOME_PAGE)*
*QUEUE_FILE = '/home/me/research/queue.txt'*
*CRAWLED_FILE = '/home/me/research/crawled.txt'*
*NUMBER_OF_THREADS = 1*
*#Captialize variables and make them class variables to make them const
variables*

*threadqueue = Queue()*

*Spider(PROJECT_NAME,HOME_PAGE,DOMAIN_NAME)*

*def crawl():*
*    change = file_to_set(QUEUE_FILE)*
*    if len(change) > 0:*
*        print str(len(change)) + 'links in the queue'*
*        create_jobs()*

*def create_jobs():*
*    for link in file_to_set(QUEUE_FILE):*
*        threadqueue.put(link) #.put = put item into the queue*
*    threadqueue.join()*
*    crawl()*
*def create_spiders():*
*    for _ in range(NUMBER_OF_THREADS): #_ basically if you dont want to
act on the iterable*
*        vari = threading.Thread(target = work)*
*        vari.daemon = True #makes sure that it dies when main exits*
*        vari.start()*

*#def regex():*
*        #for i in files_to_set(CRAWLED_FILE):*
*              #reg(i,LISTS) #MAKE FUNCTION FOR REGEX# i is url's, LISTs is
list or set of keywords*
*def work():*
*    while True:*
*        url = threadqueue.get()# pops item off queue*
*        Spider.crawl_pages(threading.current_thread().name,url)*
*        threadqueue.task_done()*

*create_spiders()*

*crawl()*


That used this class:

*from HTMLParser import HTMLParser*
*from urlparse import urlparse*

*class LinkFinder(HTMLParser):*
*    def _init_(self, base_url,page_url):*
*        super()._init_()*
*        self.base_url= base_url*
*        self.page_url = page_url*
*        self.links = set() #stores the links*
*    def error(self,message):*
*        pass*
*    def handle_starttag(self,tag,attrs):*
*        if tag == 'a': # means a link*
*            for (attribute,value) in attrs:*
*                if attribute  == 'href':  #href relative url i.e not
having www*
*                    url = urlparse.urljoin(self.base_url,value)*
*                    self.links.add(url)*
*    def return_links(self):*
*        return self.links()*

It's very unpythonic to define getters like return_links, just accessself.links directly.



And this spider class:



*from urllib import urlopen #connects to webpages from python*
*from link_finder import LinkFinder*
*from general import directory, text_maker, file_to_set, conversion_to_set*


*class Spider():*
*     project_name = 'Reader'*
*     base_url = ''*
*     Queue_file = ''*
*     crawled_file = ''*
*     queue = set()*
*     crawled = set()*


*     def __init__(self,project_name, base_url,domain_name):*
*         Spider.project_name = project_name*
*         Spider.base_url = base_url*
*         Spider.domain_name = domain_name*
*         Spider.Queue_file =  '/home/me/research/queue.txt'*
*         Spider.crawled_file =  '/home/me/research/crawled.txt'*
*         self.boot()*
*         self.crawl_pages('Spider 1 ', base_url)*

It strikes me as completely pointless to define this class when everyvariable is at the class level and every method is defined as a staticmethod. Python isn't Java :)


[code snipped]


and these functions:



*from urlparse import urlparse*

*#get subdomain name (name.example.com <http://name.example.com>)*

*def subdomain_name(url):*
*    try:*
*        return urlparse(url).netloc*
*    except:*
*        return ''*

It's very bad practice to use a bare except like this as it hides anyerrors and prevents you from using CTRL-C to break out of your code.


*def get_domain_name(url):*
*    try:*
*        variable = subdomain_name.split(',')*
*        return variable[-2] + ',' + variable[-1] #returns 2nd to last and
last instances of variable*
*    except:*
*        return '''*


The above line is a syntax error.



(there are more functions, but those are housekeeping functions)


The interpreter returned this error:

*RuntimeError: maximum recursion depth exceeded while calling a Python
object*


After calling crawl() and create_jobs() a bunch of times?

How can I resolve this?

Thanks


Just a quick glance but crawl calls create_jobs which calls crawl...

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Recursion depth exceeded in python web crawler

Reply via email to