Re: parsing HTML

Andrew Gaffney Wed, 21 Jul 2004 22:03:59 -0700

Andrew Gaffney wrote:

Randy W. Sims wrote:
On 7/21/2004 11:24 PM, Andrew Gaffney wrote:
Randy W. Sims wrote:
On 7/21/2004 10:42 PM, Andrew Gaffney wrote:
I am trying to build a HTML editor for use with my HTML::Mason site. I intend for it to support nested tables, SPANs, and anchors. I am looking for a module that can help me parse existing HTML (custom or generated by my scripts) into a tree structure similar to:

my $html = [ { tag => 'table', id => 'maintable', width => 300, content => [ { tag => 'tr', content => [ { tag => 'td', width => 200, content => "some content" }, { tag => 'td', width => 100, content => "more content" } ] ] ]; # Not tested, but you get the idea
[snip]
I'd rather generate a structure similar to what I have above instead of having a large tree of class objects that takes up more RAM and is probably slower. How would I go about generating a structure such as that above using HTML::Parser?
Parsers like HTML::Parser scan a document and upon encountering certain tokens fire off events. In the case of HTML::Parser, events are fired when encountering a start tag, the text between tags, and at the end tag. If you have an arbitrarily deep document structure like HTML, you can store the structure using a stack:


<SNIP>

Thanks. In the time it took you to put that together, I came up with the following to figure out how HTML::Parser works. I'll use your code to expand upon it.


<SNIP>

Here is my current working code. Please take a look at it and see if there are any obvious (or not so obvious) problems. I thought this would end up being far more difficult.

parsehtml.pl
============
#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser ();

my $htmltree = [ { tag => 'document', content => [] } ];
my $node = $htmltree->[0]->{content};
my @prevnodes = ($htmltree);

sub start {
  my $tagname = shift;
  my $attr = shift;
  my $newnode = {};

  $newnode->{tag} = $tagname;
  foreach my $key(keys %{$attr}) {
    $newnode->{$key} = $attr->{$key};
  }
  $newnode->{content} = [];
  push @prevnodes, $node;
  push @{$node}, $newnode;
  $node = $newnode->{content};
}

sub end {
  my $tagname = shift;

  $node = pop @prevnodes;
}

sub text {
  my $text = shift;

  chomp $text;
  if($text ne '') {
    push @{$node}, $text;
  }
}

my $p = HTML::Parser->new( api_version => 3,
                           start_h => [\&start, "tagname, attr"],
                           end_h   => [\&end,   "tagname"],
                           text_h  => [\&text,  "dtext"] );

$p->parse_file("test.html");

use Data::Dumper;
print Dumper $htmltree;

test.html
=========
<table id="maintable" width="300">
<tr>
<td width="200">some content</td>
<td width="100">more content</td>
</tr>
</table>

--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: parsing HTML

Reply via email to