On 14/06/18 19:32, Daniel Bosah wrote:
I am trying to modify code from a web crawler to scrape for keywords from
certain websites. However, Im trying to run the web crawler before I
modify it, and I'm running into issues.
When I ran this code -
*import threading*
*from Queue import Queue*
*from spider import Spider*
*from domain import get_domain_name*
*from general import file_to_set*
*PROJECT_NAME = "SPIDER"*
*HOME_PAGE = "https://www.cracked.com/ <https://www.cracked.com/>"*
*DOMAIN_NAME = get_domain_name(HOME_PAGE)*
*QUEUE_FILE = '/home/me/research/queue.txt'*
*CRAWLED_FILE = '/home/me/research/crawled.txt'*
*NUMBER_OF_THREADS = 1*
*#Captialize variables and make them class variables to make them const
variables*
*threadqueue = Queue()*
*Spider(PROJECT_NAME,HOME_PAGE,DOMAIN_NAME)*
*def crawl():*
* change = file_to_set(QUEUE_FILE)*
* if len(change) > 0:*
* print str(len(change)) + 'links in the queue'*
* create_jobs()*
*def create_jobs():*
* for link in file_to_set(QUEUE_FILE):*
* threadqueue.put(link) #.put = put item into the queue*
* threadqueue.join()*
* crawl()*
*def create_spiders():*
* for _ in range(NUMBER_OF_THREADS): #_ basically if you dont want to
act on the iterable*
* vari = threading.Thread(target = work)*
* vari.daemon = True #makes sure that it dies when main exits*
* vari.start()*
*#def regex():*
* #for i in files_to_set(CRAWLED_FILE):*
* #reg(i,LISTS) #MAKE FUNCTION FOR REGEX# i is url's, LISTs is
list or set of keywords*
*def work():*
* while True:*
* url = threadqueue.get()# pops item off queue*
* Spider.crawl_pages(threading.current_thread().name,url)*
* threadqueue.task_done()*
*create_spiders()*
*crawl()*
That used this class:
*from HTMLParser import HTMLParser*
*from urlparse import urlparse*
*class LinkFinder(HTMLParser):*
* def _init_(self, base_url,page_url):*
* super()._init_()*
* self.base_url= base_url*
* self.page_url = page_url*
* self.links = set() #stores the links*
* def error(self,message):*
* pass*
* def handle_starttag(self,tag,attrs):*
* if tag == 'a': # means a link*
* for (attribute,value) in attrs:*
* if attribute == 'href': #href relative url i.e not
having www*
* url = urlparse.urljoin(self.base_url,value)*
* self.links.add(url)*
* def return_links(self):*
* return self.links()*
It's very unpythonic to define getters like return_links, just access
self.links directly.
And this spider class:
*from urllib import urlopen #connects to webpages from python*
*from link_finder import LinkFinder*
*from general import directory, text_maker, file_to_set, conversion_to_set*
*class Spider():*
* project_name = 'Reader'*
* base_url = ''*
* Queue_file = ''*
* crawled_file = ''*
* queue = set()*
* crawled = set()*
* def __init__(self,project_name, base_url,domain_name):*
* Spider.project_name = project_name*
* Spider.base_url = base_url*
* Spider.domain_name = domain_name*
* Spider.Queue_file = '/home/me/research/queue.txt'*
* Spider.crawled_file = '/home/me/research/crawled.txt'*
* self.boot()*
* self.crawl_pages('Spider 1 ', base_url)*
It strikes me as completely pointless to define this class when every
variable is at the class level and every method is defined as a static
method. Python isn't Java :)
[code snipped]
and these functions:
*from urlparse import urlparse*
*#get subdomain name (name.example.com <http://name.example.com>)*
*def subdomain_name(url):*
* try:*
* return urlparse(url).netloc*
* except:*
* return ''*
It's very bad practice to use a bare except like this as it hides any
errors and prevents you from using CTRL-C to break out of your code.
*def get_domain_name(url):*
* try:*
* variable = subdomain_name.split(',')*
* return variable[-2] + ',' + variable[-1] #returns 2nd to last and
last instances of variable*
* except:*
* return '''*
The above line is a syntax error.
(there are more functions, but those are housekeeping functions)
The interpreter returned this error:
*RuntimeError: maximum recursion depth exceeded while calling a Python
object*
After calling crawl() and create_jobs() a bunch of times?
How can I resolve this?
Thanks
Just a quick glance but crawl calls create_jobs which calls crawl...
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.
Mark Lawrence
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor