Re: Parsing HTTP links

Curtis Poe Thu, 18 Oct 2001 15:11:38 -0700

--- [EMAIL PROTECTED] wrote:
> Hey guys, 
> 
> thanks for all the help with this.  I actually did mean HTML Links as I am 
> looking to parse out specific links from an HTML file.  I'm not only 
> concerned with "HTTP" link (<a href>) but also other HTML flags.  Right 
> now I'm using HTML::SimpleLinkExtor but I'm not sure that gives me exactly 
> what I want.
> 
> Essentially what I'm trying to do is parse out all info from a web page 
> that is in bold (<b>text</b>).  I'm going to revisit LinkExtor but if 
> there is a better solution, I'm all ears.
> 
> Greg


Greg,

I was playing around with a similar problem and subclassed HTML::TokeParser as
HTML::TokeParser::Easy.  To do what you're looking for, you could use that module and 
do this:

############################################
#!/usr/bin/perl -w
use strict;
use HTML::TokeParser::Easy;

my $file;
{
    local $/; 
    $file = <DATA>;
}

# Note: If you pass it a file name instead of the file contents,
# pass the name directly and *not* as a reference!!!
# see perldoc HTML::TokeParser for more info.

my $p = HTML::TokeParser::Easy->new( \$file );

while ( my $token = $p->get_token ) {
    if ( $p->is_start_tag( $token ) and $p->return_tag( $token ) eq 'b' ) {
        my $bold_text = '';
        $token = $p->get_token;
        while ( ! ( $p->is_end_tag( $token ) and $p->return_tag( $token ) eq 'b' ) ) {
            $bold_text .= $p->return_text( $token );
            $token = $p->get_token;
        }
        print "$bold_text\n";
    }
}
__DATA__
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<html>
<head>
    <title>Untitled</title>
</head>

<body>
  <h1>test</h1>
  <b>This is the first <i>bold</i> text.</b>
  <i>This should not appear.</i>
  <b>This is the second bold text.</b>
</body>
</html>
############################################

The output from the above is:

This is the first <i>bold</i> text.
This is the second bold text.

To use it, you would have to install HTML::TokeParser and my HTML::TokeParser::Easy 
module (which
I just uploaded at http://www.easystreet.com/~ovid/cgi_course/downloads/Easy.pm).  I 
haven't
bothered to create a complete install package for it, so go into one of your Perl lib 
directories
and in an HTML older (something like /usr/bin/perl/site/lib/html/) create a TokeParser 
directory
and place Easy.pm in that directory.  Full POD is included so, after you install it, 
you can type
'perldoc HTML::TokeParser::Easy' to see how to use it.

Frankly, I think the module is a bit of a hack, but if it works...

Cheers,
Curtis "Ovid" Poe

=====
Senior Programmer
Onsite! Technology (http://www.onsitetech.com/)
"Ovid" on http://www.perlmonks.org/

__________________________________________________
Do You Yahoo!?
Make a great connection at Yahoo! Personals.
http://personals.yahoo.com

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing HTTP links

Reply via email to