python-parser running Beautiful Soup needs to be reviewed
Hello commnity i am new to Python and to Beatiful Soup also! It is told to be a great tool to parse and extract content. So here i am...: I want to take the content of a -tag of a table in a html document. For example, i have this table This is a sample text This is the second sample text How can i use beautifulsoup to take the text "This is a sample text"? Should i make use soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get the whole table. See the target http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323 Well - what have we to do first: The first thing is t o find the table: i do this with Using find rather than findall returns the first item in the list (rather than returning a list of all finds - in which case we'd have to add an extra [0] to take the first element of the list): table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'}) Then use find again to find the first td: first_td = soup.find('td') Then we have to use renderContents() to extract the textual contents: text = first_td.renderContents() ... and the job is done (though we may also want to use strip() to remove leading and trailing spaces: trimmed_text = text.strip() This should give us: print trimmed_text This is a sample text as desired. What do you think about the code? I love to hear from you!? greetings matze -- http://mail.python.org/mailman/listinfo/python-list
python-parser running Beautiful Soup only spits out one line of 10. What i have gotten wrong here?
Hello dear Community,. I am trying to get a scraper up and running: And keep running into problems. when I try what you have i have learnedd so far I only get: Schuldaten Here is the code that I used: import urllib2 from BeautifulSoup import BeautifulSoup page = urllib2.urlopen("http://www.schulministerium.nrw.de/BP/ SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323") soup = BeautifulSoup(page) table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'}) first_td = soup.find('td') text = first_td.renderContents() trimmed_text = text.strip() print trimmed_text i run it in the template at http://scraperwiki.com/scrapers/new/python see the target: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323 What have I gotten wrong? Can anybody review the code - many thanks in Advance regards matze -- http://mail.python.org/mailman/listinfo/python-list
need some debug-infos on a simple regex
hello dear list! i'm very new to programming and self teaching myself. I'm having a problem with a little project. I'm trying to preform an fetch-process, but every time i try it i runs into errors. i have read the Python-documents for more than ten hours now! And i have several books here - but they do not help at the moment. This code runs like a charme!! import urllib import urlparse import re url = "http://search.cpan.org/author/?W"; html = urllib.urlopen(url).read() for lk, capname, name in re.findall('(.*?)(.*?)', html): alk = urlparse.urljoin(url, lk) data = { 'url':alk, 'name':name, 'cname':capname } phtml = urllib.urlopen(alk).read() memail = re.search('', phtml) if memail: data['email'] = memail.group(1) print data Note the above mentioned code runs very very good. All is nice. Now i want to apply it on a new target. I can learn alot with this ...Let us say on this swiss-site:educa.ch: What is aimed: I want to adopt it on a new target to learn mor about regex and to do some homework - (working as a teacher - and collecting some data bout colleagues) How should we fetch the sites - that is the problem..i want to learn while applying the code...What is necessary to apply the example on the target!? the target: http://www.educa.ch/dyn/79362.asp?action=search But the code (see below) does not run - i tried several things to debug - can yozu help me!? BTW - should i fetch the pages and load them into an array or should i loop over the http://www.educa.ch/dyn/79376.asp?id=2635 http://www.educa.ch/dyn/79376.asp?id=3493 and so on... see the code that does not work!? import urllib import urlparse import re url = "http://www.educa.ch/dyn/"; html = urllib.urlopen("http://www.educa.ch/dyn/79362.asp? action=search").read() for capname, lk in re.findall(']+>([^<] +).*?', phtml) if memail: data['email'] = memail.group(1) print data Look forward to get some starting points... thx matze -- http://mail.python.org/mailman/listinfo/python-list