Re: other ways to parse emails from html?

Charles DeRykus Thu, 31 Jan 2013 20:49:42 -0800

On Thu, Jan 31, 2013 at 1:07 PM, Octavian Rasnita <orasn...@gmail.com> wrote:
> From: "Jeswin" <phillyj...@gmail.com>
>
>
>> ...
>
>
>
>
> It depends what kind of emails you want to get from those web pages.
>
> If you want to get only and only the emails that appear in links, the most
> easier way is to use something like:
>
> #It will get the emails from links like: <a
> href="mailto:n...@host.com";>E-mail</a>
>
> use strict;
> use warnings;
> use LWP::Simple;
> use HTML::TreeBuilder;
>
> my $content = get( 'http://www.site.org/' );
> my $tree = HTML::TreeBuilder->new_from_content( $content );
>
> my @links = $tree->look_down( _tag => 'a', sub {
>    $_[0]->attr('href') && $_[0]->attr('href') =~ /^\s*mailto:/;
> } );
>
> my @emails;
>
> for my $link ( @links ) {
>    my $url = $link->attr('href');
>    $url =~ s/^\s*mailto:\s*//;
>    push( @emails, $link );
> }
>
> (Untested)  You should have the emails in @emails.


Variant of above:

use HTML::TreeBuilder 5.03;
use URI;

my $t = HTML::TreeBuilder->new_from_url( $some_url );

my @emails;
foreach my $link ( $t->look_down(_tag=>'a', href=>qr/^\s*mailto/) ) {
     my $url = URI->new( $link->attr('href') );
     push( @emails, $url->path );
}

>
>
> But if you want to get any e-mail from a page, no matter if it appears in a
> link or not, probably the easiest way would be to use regular expressions.
> Or search on CPAN if there is a module that does that easier.

-- 
Charles DeRykus

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: other ways to parse emails from html?

Reply via email to