Hi,

As a basic reply: coming up with a generic HTML parser for the kind of thing
your doing will be difficult. You may find it quicker (in terms of
development time) to do a custom hack for every website you'll be looking
at. Other perl sites/lists would be able to help you better, a good place to
start would be www.perlmonks.net. Saying that there might be an online
database for the kind of information you are looking for ( I'd be surprised
if there isn't).

For what it's worth, you've probably chosen the correct language - but there
will be a learning curve - it wont be easy:)

cheers,

jez.


----- Original Message ----- 
From: "#SHUCHI MITTAL#" <[EMAIL PROTECTED]>
To: <perl-win32-gui-users@lists.sourceforge.net>
Sent: Thursday, January 08, 2004 5:04 PM
Subject: [perl-win32-gui-users] General Perl Text Extraction doubt


Hi all

Since everyone here is a perl expert and im a total newbie i would be very
very grateful if someone could help me out with my doubts.

I am doing a project to develop a student professor system including
databases etc. To start off I need lots of professor data from various
websites of educational institutions( for populating my database) . To
extract this data and get started I decided to use perl since its text
extraction capabilities are known to one n all.

The problem is all these sites have a totally different HTML format and
structure and differ in which the info of all profs is listed, and I cant
seem to come up with a generic PERL code to extract this data and put it in
text files on my local hard disk. Therefore I think ill need to use REGEX
and PATTERN MATCHING to do the task but im not sure how to go about it. I
wrote one code that works with www.ntu.edu.sg/sce/staffacad.asp but this is
way to specific and doesnt work with any other staff sites.!
I need to do the following:

1. Visit the base site of any institute and extract professor information
which includes NAME,EMAIL,DEGREE,RESEARCH INTERESTS AND PUBLICATIONS
RELEASED
2. For publications the listing either appears via a link on the profs
homepages or as a chunk of data under the heading "PUBLICATIONS" etc. I
think i can get the data if its via a link but i dunno hoe to extract that
exact chunk in the middle of a page.
3. All this info shud be extracted to external text files

I can manage if someone just helps me with snippets of code to gt started
with the extraction...accurate extraction of information from any random
site of a intitution which has profs listed etc.
For example some sites are www.ntu.edu.sg/sce/staffacad.asp ,
http://www.ntu.edu.sg/eee/people/, http://www.ie.cuhk.edu.hk/index.php?id=6,
http://www.ntu.edu.sg/mpe/admin/staff.asp

Greatly appreciate any help in any direction...totally lost here..please
feel free to ask if u have any doubts regarding my question!

shuchi



-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Perl-Win32-GUI-Users mailing list
Perl-Win32-GUI-Users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perl-win32-gui-users


Reply via email to