Re: other ways to parse emails from html?

Octavian Rasnita Thu, 31 Jan 2013 13:08:24 -0800

From: "Jeswin" <phillyj...@gmail.com>

Hi again,
I tried to use the treebuilder modules to get emails from a webpage
html but I don't know enough. It just gave me more headaches.

My current method get the emails is to go to the site, put the source
code in MS Word, and run a regex to get all the emails in that html
page.

I think I can get the list of sites in a file and probably download
the html source codes and parse offline. Can't I just use regex to
parse the emails? What can go wrong?

I'm a noob at perl and not a programmer.

Thanks for the input




It depends what kind of emails you want to get from those web pages.

If you want to get only and only the emails that appear in links, the mosteasier way is to use something like:

#It will get the emails from links like: <ahref="mailto:n...@host.com";>E-mail</a>


use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder;

my $content = get( 'http://www.site.org/' );
my $tree = HTML::TreeBuilder->new_from_content( $content );

my @links = $tree->look_down( _tag => 'a', sub {
   $_[0]->attr('href') && $_[0]->attr('href') =~ /^\s*mailto:/;
} );

my @emails;

for my $link ( @links ) {
   my $url = $link->attr('href');
   $url =~ s/^\s*mailto:\s*//;
   push( @emails, $link );
}

(Untested)  You should have the emails in @emails.

But if you want to get any e-mail from a page, no matter if it appears in alink or not, probably the easiest way would be to use regular expressions.Or search on CPAN if there is a module that does that easier.


Octavian


--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: other ways to parse emails from html?

Reply via email to