Hi all
Since everyone here is a perl expert and im a total newbie i would be very very
grateful if someone could help me out with my doubts.
I am doing a project to develop a student professor system including databases
etc. To start off I need lots of professor data from various websites of
educational institutions( for populating my database) . To extract this data
and get started I decided to use perl since its text extraction capabilities
are known to one n all.
The problem is all these sites have a totally different HTML format and
structure and differ in which the info of all profs is listed, and I cant seem
to come up with a generic PERL code to extract this data and put it in text
files on my local hard disk. Therefore I think ill need to use REGEX and
PATTERN MATCHING to do the task but im not sure how to go about it. I wrote one
code that works with www.ntu.edu.sg/sce/staffacad.asp but this is way to
specific and doesnt work with any other staff sites.!
I need to do the following:
1. Visit the base site of any institute and extract professor information which
includes NAME,EMAIL,DEGREE,RESEARCH INTERESTS AND PUBLICATIONS RELEASED
2. For publications the listing either appears via a link on the profs
homepages or as a chunk of data under the heading "PUBLICATIONS" etc. I think i
can get the data if its via a link but i dunno hoe to extract that exact chunk
in the middle of a page.
3. All this info shud be extracted to external text files
I can manage if someone just helps me with snippets of code to gt started with
the extraction...accurate extraction of information from any random site of a
intitution which has profs listed etc.
For example some sites are www.ntu.edu.sg/sce/staffacad.asp ,
http://www.ntu.edu.sg/eee/people/, http://www.ie.cuhk.edu.hk/index.php?id=6,
http://www.ntu.edu.sg/mpe/admin/staff.asp
Greatly appreciate any help in any direction...totally lost here..please feel
free to ask if u have any doubts regarding my question!
shuchi