On Friday, April 5, 2002, at 10:43 , Paul Tremblay wrote:
[..]
> The problem is that the filter deletes all of my text and ouputs this:
>
> [TABLE NOT SHOWN][TABLE NOT SHOWN][TABLE NOT SHOWN][TABLE NOT
> SHOWN][TABLE NOT SHOWN]

Right! that is the big clue I should have seen - there is no
'plain html stuff' - it's all stuffed in tables....

I just ran the code against a webPage that is all one big
form - with some table foo on the inside.... and got your
equivolent response...

>
> I have tried it on five different files. All of these files were
> from the same website. It appears that this module is broken.
> That is, it can't handle certain html (which is valid when looked
> at in a browser).

I just smelled the coffee - all of the 'information' that you
are looking for is being presented in Tables - and that in essence
these clasess of webPages are little more than

<HTML><HEAD><TITLE>SomeBuzzHere</TITLE></HEAD>

<BODY BGCOLOR=#ffffff>
Table, table table.....
.....

maybe not even closed with

</BODY></HTML>

so what you want to do is something along the line of actually
spin up some code like:

     my $page = '';
        my @tables = $tree->look_down( "_tag", "table");

     foreach my $tab (@tables) {
        my @Th_list = $tab->look_down("_tag", "th");

         foreach my $t (@Th_list) {
                 next unless($t);
             foreach my $item_r ( $t->content_refs_list ) {
                 next if ref $$item_r;
                 $page .=  "$$item_r \n";
             }
         }

         my @Tr_list = $tab->look_down("_tag", "tr" );

         foreach my $tr (@Tr_list) {
                 my @td_list = $tr->look_down("_tag", "td" );

                 foreach my $t (@td_list) {
                         foreach my $item_r ( $t->content_refs_list ) {
                                 next if ref $$item_r;
                                 $page .=  "$$item_r ";
                        }
                 }
                 $page .= "\n" if (@td_list);
         }
         $page .=  "#---------\n";

         @Tr_list=();
     }

     print $page ;

so that you wind up sucking out the details from the table elements
themselves .....

the problem is not really with:

use HTML::Parser;
use HTML::FormatText;
use HTML::TreeBuilder;

my $html_text;
my $filename = $ARGV[0];
open(FH, $filename) or die "unable to open file $filename :$!\n";
while (<FH>) { $html_text .= $_ ; }
###my $plain_text = HTML::FormatText->new->format(parse_html($html_text));
my $tree = HTML::TreeBuilder->new->parse($html_text);
my $plain_text = HTML::FormatText->new->format($tree);

print "$plain_text\n";

#----

save that it can only do what it does -

ciao
drieux

---


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to