Hi, As a basic reply: coming up with a generic HTML parser for the kind of thing your doing will be difficult. You may find it quicker (in terms of development time) to do a custom hack for every website you'll be looking at. Other perl sites/lists would be able to help you better, a good place to start would be www.perlmonks.net. Saying that there might be an online database for the kind of information you are looking for ( I'd be surprised if there isn't).
For what it's worth, you've probably chosen the correct language - but there will be a learning curve - it wont be easy:) cheers, jez. ----- Original Message ----- From: "#SHUCHI MITTAL#" <[EMAIL PROTECTED]> To: <perl-win32-gui-users@lists.sourceforge.net> Sent: Thursday, January 08, 2004 5:04 PM Subject: [perl-win32-gui-users] General Perl Text Extraction doubt Hi all Since everyone here is a perl expert and im a total newbie i would be very very grateful if someone could help me out with my doubts. I am doing a project to develop a student professor system including databases etc. To start off I need lots of professor data from various websites of educational institutions( for populating my database) . To extract this data and get started I decided to use perl since its text extraction capabilities are known to one n all. The problem is all these sites have a totally different HTML format and structure and differ in which the info of all profs is listed, and I cant seem to come up with a generic PERL code to extract this data and put it in text files on my local hard disk. Therefore I think ill need to use REGEX and PATTERN MATCHING to do the task but im not sure how to go about it. I wrote one code that works with www.ntu.edu.sg/sce/staffacad.asp but this is way to specific and doesnt work with any other staff sites.! I need to do the following: 1. Visit the base site of any institute and extract professor information which includes NAME,EMAIL,DEGREE,RESEARCH INTERESTS AND PUBLICATIONS RELEASED 2. For publications the listing either appears via a link on the profs homepages or as a chunk of data under the heading "PUBLICATIONS" etc. I think i can get the data if its via a link but i dunno hoe to extract that exact chunk in the middle of a page. 3. All this info shud be extracted to external text files I can manage if someone just helps me with snippets of code to gt started with the extraction...accurate extraction of information from any random site of a intitution which has profs listed etc. For example some sites are www.ntu.edu.sg/sce/staffacad.asp , http://www.ntu.edu.sg/eee/people/, http://www.ie.cuhk.edu.hk/index.php?id=6, http://www.ntu.edu.sg/mpe/admin/staff.asp Greatly appreciate any help in any direction...totally lost here..please feel free to ask if u have any doubts regarding my question! shuchi ------------------------------------------------------- This SF.net email is sponsored by: Perforce Software. Perforce is the Fast Software Configuration Management System offering advanced branching capabilities and atomic changes on 50+ platforms. Free Eval! http://www.perforce.com/perforce/loadprog.html _______________________________________________ Perl-Win32-GUI-Users mailing list Perl-Win32-GUI-Users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/perl-win32-gui-users