On Jan 31, 2013, at 11:49 AM, Jeswin wrote:

> Hi again,
> I tried to use the treebuilder modules to get emails from a webpage
> html but I don't know enough. It just gave me more headaches.

You should post a short program here that demonstrates the problem you are 
having. Can you give us the URL of the page from which you are trying to 
extract addresses? 

> My current method get the emails is to go to the site, put the source
> code in MS Word, and run a regex to get all the emails in that html
> page.
> 
> I think I can get the list of sites in a file and probably download
> the html source codes and parse offline. Can't I just use regex to
> parse the emails? What can go wrong?

Some parts of HTML are difficult to parse, because there are so many variations 
allowed in the source. Usually, however, extracting selected information from 
an HTML file can be done with regular expressions, as long as the page doesn't 
contain any weird HTML constructs. There is a good chance that the email 
addresses all follow the simple convention of 'mailto:n...@host.tld'. In this 
case, extracting those addresses could be very simple. Your program may miss 
some weird variations, but you should be able to get most of them.


--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to