On 13/11/2010 01:21, Martin Kaspar wrote:
hello dear list!

i'm very new to programming and self teaching myself. I'm having a
problem with a little project.

I'm trying to preform an fetch-process, but every time i try it i runs
into errors.
i have read the Python-documents for  more than ten hours now!  And i
have several books here
- but they do not help at the moment. This code runs like a charme!!


import urllib
import urlparse
import re

url = "http://search.cpan.org/author/?W";
html = urllib.urlopen(url).read()
for lk, capname, name in re.findall('<a
href="(/~.*?/)"><b>(.*?)</b></a><br/><small>(.*?)</small>', html):
     alk = urlparse.urljoin(url, lk)

     data = { 'url':alk, 'name':name, 'cname':capname }

     phtml = urllib.urlopen(alk).read()
     memail = re.search('<a href="mailto:(.*?)">', phtml)
     if memail:
         data['email'] = memail.group(1)

     print data

Note the above mentioned code runs very very good. All is nice. Now i
want to apply it on a new target. I can learn alot with this ...Let us
say on this swiss-site:educa.ch:

What is aimed: I want to adopt it on a new target to learn mor about
regex  and to do some homework - (working as a teacher - and
collecting some data bout colleagues) How should we fetch the sites -
that is the problem..i want to learn while applying the
code...What is necessary to apply the example on the target!?

the target:  http://www.educa.ch/dyn/79362.asp?action=search

But the code (see below) does not run - i tried several things to
debug - can yozu help me!?
BTW - should i fetch the pages and load them into an array or should i
loop over the

http://www.educa.ch/dyn/79376.asp?id=2635
http://www.educa.ch/dyn/79376.asp?id=3493
and so on...

see the code that does not work!?

import urllib
import urlparse
import re

url = "http://www.educa.ch/dyn/";
html = urllib.urlopen("http://www.educa.ch/dyn/79362.asp?
action=search").read()
for capname, lk in re.findall('<a name="\d+"></a><br><img [^>]+>([^<]
+).*?<a href="#\d+" onclick="javascript: window.open\(\'(\d+.asp?id=\d
+)\'', html):
alk = urlparse.urljoin(url, lk)

data = { 'url':alk, 'cname':capname }

phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto.*?)">', phtml)
if memail:
data['email'] = memail.group(1)

print data

Look forward to get some starting points...

Don't just say "does not run" or "does not work". That's not very
helpful. It's like saying "My car doesn't work. How should I fix it?".
:-)

When writing regexes it's recommended that you use raw string literals.

Your first regex contains 'asp?', which is saying that 'p' is optional.
I think you meant 'asp\?'. Also, '.' will match any character except
'\n'. If want to match an actual '.' then use '\.'.

Your second regex contains a closing parenthesis ')' but no opening
parenthesis '('.
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to