python-parser running Beautiful Soup needs to be reviewed

2010-12-11 Thread Martin Kaspar
Hello commnity

i am new to Python and to Beatiful Soup also!
It is told to be a great tool to parse and extract content. So here i
am...:

I want to take the content of a -tag of a table in a html
document. For example, i have this table




 This is a sample text



 This is the second sample text




How can i use beautifulsoup to take the text "This is a sample text"?

Should i make use
soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get
the whole table.

See the target 
http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323

Well - what have we to do first:

The first thing is t o find the table:

i do this with Using find rather than findall returns the first item
in the list
(rather than returning a list of all finds - in which case we'd have
to add an extra [0]
to take the first element of the list):


table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})

Then use find again to find the first td:

first_td = soup.find('td')

Then we have to use renderContents() to extract the textual contents:

text = first_td.renderContents()

... and the job is done (though we may also want to use strip() to
remove leading and trailing spaces:

trimmed_text = text.strip()

This should give us:


print trimmed_text
This is a sample text

as desired.


What do you think about the code? I love to hear from you!?

greetings
matze
-- 
http://mail.python.org/mailman/listinfo/python-list


python-parser running Beautiful Soup only spits out one line of 10. What i have gotten wrong here?

2010-12-25 Thread Martin Kaspar
Hello dear Community,.


I am trying to get a scraper up and running: And keep running into
problems.

when I try what you have i have learnedd so far I only get:
Schuldaten

Here is the code that I used:

import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.schulministerium.nrw.de/BP/
SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323")
soup = BeautifulSoup(page)
table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})
first_td = soup.find('td')
text = first_td.renderContents()
trimmed_text = text.strip()
print trimmed_text


i run it in the template at http://scraperwiki.com/scrapers/new/python

see the target: 
http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323

What have I gotten wrong?

Can anybody review the code -

many thanks in Advance

regards
matze
-- 
http://mail.python.org/mailman/listinfo/python-list


need some debug-infos on a simple regex

2010-11-12 Thread Martin Kaspar
hello dear list!

i'm very new to programming and self teaching myself. I'm having a
problem with a little project.

I'm trying to preform an fetch-process, but every time i try it i runs
into errors.
i have read the Python-documents for  more than ten hours now!  And i
have several books here
- but they do not help at the moment. This code runs like a charme!!


import urllib
import urlparse
import re

url = "http://search.cpan.org/author/?W";
html = urllib.urlopen(url).read()
for lk, capname, name in re.findall('(.*?)(.*?)', html):
alk = urlparse.urljoin(url, lk)

data = { 'url':alk, 'name':name, 'cname':capname }

phtml = urllib.urlopen(alk).read()
memail = re.search('', phtml)
if memail:
data['email'] = memail.group(1)

print data

Note the above mentioned code runs very very good. All is nice. Now i
want to apply it on a new target. I can learn alot with this ...Let us
say on this swiss-site:educa.ch:

What is aimed: I want to adopt it on a new target to learn mor about
regex  and to do some homework - (working as a teacher - and
collecting some data bout colleagues) How should we fetch the sites -
that is the problem..i want to learn while applying the
code...What is necessary to apply the example on the target!?

the target:  http://www.educa.ch/dyn/79362.asp?action=search

But the code (see below) does not run - i tried several things to
debug - can yozu help me!?
BTW - should i fetch the pages and load them into an array or should i
loop over the

http://www.educa.ch/dyn/79376.asp?id=2635
http://www.educa.ch/dyn/79376.asp?id=3493
and so on...

see the code that does not work!?

import urllib
import urlparse
import re

url = "http://www.educa.ch/dyn/";
html = urllib.urlopen("http://www.educa.ch/dyn/79362.asp?
action=search").read()
for capname, lk in re.findall(']+>([^<]
+).*?', phtml)
if memail:
data['email'] = memail.group(1)

print data

Look forward to get some starting points...

thx  matze

-- 
http://mail.python.org/mailman/listinfo/python-list