Hi all
 
Since everyone here is a perl expert and im a total newbie i would be very very 
grateful if someone could help me out with my doubts.
 
I am doing a project to develop a student professor system including databases 
etc. To start off I need lots of professor data from various websites of 
educational institutions( for populating my database) . To extract this data 
and get started I decided to use perl since its text extraction capabilities 
are known to one n all. 
 
The problem is all these sites have a totally different HTML format and 
structure and differ in which the info of all profs is listed, and I cant seem 
to come up with a generic PERL code to extract this data and put it in text 
files on my local hard disk. Therefore I think ill need to use REGEX and 
PATTERN MATCHING to do the task but im not sure how to go about it. I wrote one 
code that works with www.ntu.edu.sg/sce/staffacad.asp but this is way to 
specific and doesnt work with any other staff sites.!
I need to do the following:
 
1. Visit the base site of any institute and extract professor information which 
includes NAME,EMAIL,DEGREE,RESEARCH INTERESTS AND PUBLICATIONS RELEASED
2. For publications the listing either appears via a link on the profs 
homepages or as a chunk of data under the heading "PUBLICATIONS" etc. I think i 
can get the data if its via a link but i dunno hoe to extract that exact chunk 
in the middle of a page. 
3. All this info shud be extracted to external text files 
 
I can manage if someone just helps me with snippets of code to gt started with 
the extraction...accurate extraction of information from any random site of a 
intitution which has profs listed etc.
For example some sites are www.ntu.edu.sg/sce/staffacad.asp , 
http://www.ntu.edu.sg/eee/people/, http://www.ie.cuhk.edu.hk/index.php?id=6, 
http://www.ntu.edu.sg/mpe/admin/staff.asp
 
Greatly appreciate any help in any direction...totally lost here..please feel 
free to ask if u have any doubts regarding my question!
 
shuchi
 

Reply via email to