On May 3, 2013, at 4:59 AM, Edward and Erica Heim wrote:

> Hi all,
> 
> I'm using  LWP::UserAgent to access a website. One of the methods returns 
> HTML data e.g.
> 
> my $data = $response->content;
> 
> I.e. $data contains the HTML content. I want to be able to parse it line by 
> line e.g.
> 
> foreach (split /pattern/, $data) {
>    my $line = $_;
> ......
> 
> If I print $data, I can see the individual lines of the HTML data but I'm not 
> clear on the "pattern" that I should use in split or if there is a better way 
> to do this.

If the lines are separated by new lines "\n", then the pattern is /\n/:

for my $line ( split(/\n/,$data) ) {
  …

The lines could also use carriage return - line feed: /\r\n/ (or is it /\n\r/?).

The pattern /[\r\n]+/ will handle both but it will also gobble up blank lines 
-- two successive line ending characters or pairs of characters.

> 
> I understand that there are packages to parse HTML code but this is also a 
> learning exercise for me.
> 

I am currently using HTML::TokeParser to parse HTML files. It is pretty easy to 
use:

use HTML::TokeParser;

…

my $parser = HTML::TokeParser->(\$data);    # assuming $data contains the HTML 
text to be parsed
while( my $token = $parser->get_token() ) {
  my $type = $token->[0];
  if( $type eq 'S' ) {
    my $tag = $token->[1];
    print "Start of tag $tag\n";
  }elsif( $type eq 'E' ) {
    print "End of tag $token->[1]\n";
  }elsif( $type eq 'T' ) {
    my $text = $token->[1];
    print "Text: $text\n";
  }elsif( $type eq 'C' ) {
    print "Comment: $text\n";
  }elsif( $type eq 'D' ) {
    print "Declaration: $text\n";
  }else{
    print "Unknown type $type!!!\n"
  }
}

See 'perldoc HTML::TokeParser' for details.

There are lots of other parsers out there. Some have special uses, like 
HTML::LinkExtor for extracting links, and HtmL::TableExtract for extracting 
information from HTML tables. Some modules, like HTML::TreeBuilder, build an 
in-memory model of the HTML page that you can traverse or search for 
information.

Good luck.


--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to