Re: parsing HTML

Andrew Gaffney Wed, 21 Jul 2004 21:07:07 -0700

Randy W. Sims wrote:

On 7/21/2004 11:24 PM, Andrew Gaffney wrote:
Randy W. Sims wrote:
On 7/21/2004 10:42 PM, Andrew Gaffney wrote:
I am trying to build a HTML editor for use with my HTML::Mason site. I intend for it to support nested tables, SPANs, and anchors. I am looking for a module that can help me parse existing HTML (custom or generated by my scripts) into a tree structure similar to:

my $html = [ { tag => 'table', id => 'maintable', width => 300, content => [ { tag => 'tr', content => [ { tag => 'td', width => 200, content => "some content" }, { tag => 'td', width => 100, content => "more content" } ] ] ]; # Not tested, but you get the idea
[snip]
I'd rather generate a structure similar to what I have above instead of having a large tree of class objects that takes up more RAM and is probably slower. How would I go about generating a structure such as that above using HTML::Parser?
Parsers like HTML::Parser scan a document and upon encountering certain tokens fire off events. In the case of HTML::Parser, events are fired when encountering a start tag, the text between tags, and at the end tag. If you have an arbitrarily deep document structure like HTML, you can store the structure using a stack:
#!/usr/bin/perl
package SampleParser;
use strict;
use HTML::Parser;
use base qw(HTML::Parser);
sub start {
    my($self, $tagname, $attr, $attrseq, $origtext) = @_;
    my $stack = $self->{_stack};
    my $depth = $stack ? @$stack : 0;
    print ' ' x $depth, "<$tagname>\n";
    push @{$self->{_stack}}, ' ';
}
sub end {
    my($self, $tagname, $origtext) = @_;
    pop @{$self->{_stack}};
    my $stack = $self->{_stack};
    my $depth = $stack ? @$stack : 0;
    print ' ' x $depth, "<\\$tagname>\n";
}
1;
package main;
use strict;
use warnings;
my $p = SampleParser->new();
$p->parse_file(\*DATA);
__DATA__
<html>
<head>
<title>Title</title>
<body>
The body.
</body>
</html>

Thanks. In the time it took you to put that together, I came up with the following to figure out how HTML::Parser works. I'll use your code to expand upon it.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser ();

sub start {
  print "start ";
  foreach my $arg (@_) {
    if(ref($arg) eq 'HASH') {
      foreach my $key(keys %{$arg}) {
        print "  $key - $arg->{$key}\n";
      }
    } else {
      print "$arg\n";
    }
  }
}

sub end {
  print "end ";
  foreach(@_) {
    print "$_\n";
  }
}

sub text {
  my $text = shift;

  chomp $text;
  print "  text - '$text'\n" if($text ne '');
}

my $p = HTML::Parser->new( api_version => 3,
                           start_h => [\&start, "tagname, attr"],
                           end_h   => [\&end,   "tagname"],
                           text_h  => [\&text,  "dtext"],
                           marked_sections => 1 ); # Not sure what this does

$p->parse_file("test.html");

The above gives me the expected output for the sample HTML I provided before.

--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: parsing HTML

Reply via email to