Shawn Milochik wrote:
> 
> I have written a script which is very useful for me day-to-day.  It checks
> table structure in HTML files.  The script is working, but I would
> appreciate any comments, especially as to how this can be better written.

Ok, you asked for it.  :-)  If you want to check HTML files you might want to
use a program like weblint[0] or a module like HTML::Lint[1] or use an HTML
parser[2] to extract the tables instead of trying to do it with regular expressions.

[0] http://packages.debian.org/unstable/web/weblint.html
[1] http://search.cpan.org/author/PETDANCE/HTML-Lint-1.21/
[2] http://search.cpan.org/search?query=HTML+parse&mode=module


> Code follows:
> :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
> #!/usr/bin/perl

Although your code compiles without warnings you should enable warnings when
developing your program.  You should also use strict and lexical variables.

use warnings;
use strict;


> #This script looks for bad table
> #structure in HTML pages.  When using
> #server-side scripting, do not run this
> #on the source page -- load the page in
> #a browser, view source, and save the source
> #as a file.
> 
> #This script does not take into account client-side
> #scripting on the page.  For example, if a Javascript
> #function writes out table elements in a loop, this
> #script will report an error.  This is just a helpful
> #tool for the web programmer who already knows what
> #they are doing.
> 
> #Created by Shawn Milochik, Oct 31 2002.  Released
> #free of any restrictions, and without any warranty.
> 
> #execution:  checktables.pl filename.html
> 
> open (IN, "<" . $ARGV[0]) || die "Could not open input file.\n";

You should include the filename and the $! variable in your error message.

open IN, "< $ARGV[0]" or die "Could not open $ARGV[0]: $!";


> $line = 0;
> $last = undef;
> 
> #Open each line of the file.
> while (<IN>){

You should have the chomp() here instead.

      chomp;


>     #Keep track of the line number, so we
>     #can tell the user which line of the HTML has
>     #the problem.
>     $line++;

Perl provides the $. variable which contains the current line number.


>     #Split the line on the > character.
>     @data = split />/, $_;
>     while (@data){
>         #Take the portion of the line containing an HTML tag
>         chomp($test = lc(shift(@data)));

You would normally use foreach to loop over the values in an array.  Also, the chomp
isn't required

      foreach ( @data ) {
          my $test = lc;

Or just:

      foreach ( split />/ ) {
          my $test = lc;


>         if (($test =~ /<t/) || ($test =~ /<\/t/)){

You can simplifiy this a bit.

          if ( $test =~ m{</?t} ) {


>             #print "Tag found in $test.\n";
> 
>             $curr = undef;
>             if ($test =~ "<table"){ $curr = "<table";}
>             if ($test =~ "<\/table"){ $curr = "<\/table";}
>             if ($test =~ "<tr"){ $curr = "<tr";}
>             if ($test =~ "<\/tr"){ $curr = "<\/tr";}
>             if ($test =~ "<td"){ $curr = "<td";}
>             if ($test =~ "<\/td"){ $curr = "<\/td";}

Slashes don't have to be escaped in strings or regular expressions that don't use
slashes as delimiters.

if ($test =~ /<\/table/){ $curr = q/<\/table/ } # must escape slashes!
if ($test =~ m^</table^){ $curr = q|</table| }  # OK


You can simplify those seven statements as:

            my $curr = $1 if $test =~ m!(</?t(?:able|d|r))\b!;


>             #if we found a valid table tag
>             if ($curr){
>                 #If this is not the first
>                 #iteration of this block
>                 if ($last){
>                     if ($last eq "<table"){
>                         if ($curr ne "<tr"){ print "Line $line: found $curr
> instead of <tr after <table from line $lastline.\n";}
>                     }
> 
>                     if ($last eq "<\/table"){
>                         if (($curr ne "<table") && ($curr ne "<\/td")){
> print "Line $line: found $curr instead of <\/td or <table after <\/table
> from line $lastline.\n";}
>                     }
> 
>                     if ($last eq "<tr"){
>                         if ($curr ne "<td"){ print "Line $line: found $curr
> instead of <td after $last from line $lastline.\n";}
>                     }
> 
>                     if ($last eq "<\/tr"){
>                         if (($curr ne "<\/table") && ($curr ne "<tr")){
> print "Line $line: found $curr instead of <tr or <\/table after $last from
> line $lastline.\n";}
>                     }
> 
>                     if ($last eq "<td"){
>                         if (($curr ne "<table") && ($curr ne "<\/td")){
> print "Line $line: found $curr instead of <table or <\/td after $last from
> line $lastline.\n";}
>                     }
> 
>                     if ($last eq "<\/td"){
>                         if (($curr ne "<td") && ($curr ne "<\/tr")){ print
> "Line $line: found $curr instead of <td or <\/tr after $last from line
> $lastline.\n";}
>                     }
> 
>                     $last = $curr;
>                     $lastline = $line;
>                 }else{
>                     #First iteration, initialize
>                     #$last
>                     $last = $curr;
>                 }#close curly brace for if ($last) block
>             }#close curly brace for if ($curr) block
> 
>         }#close curly brace for if (($test =~ /<t/) || ($test =~ /<\/t/))
> block
> 
>     }#close curly brace for while(@data) block
> }#close curly brace for while(<IN>) block
> 
> print "Check complete.\n";
> 
> :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
> End of Code


You can remove a lot of the indentation by using next and combining some of
those if statements.  In summary:

#!/usr/bin/perl
use warnings;
use strict;

#This script looks for bad table
#structure in HTML pages.  When using
#server-side scripting, do not run this
#on the source page -- load the page in
#a browser, view source, and save the source
#as a file.

#This script does not take into account client-side
#scripting on the page.  For example, if a Javascript
#function writes out table elements in a loop, this
#script will report an error.  This is just a helpful
#tool for the web programmer who already knows what
#they are doing.

#Created by Shawn Milochik, Oct 31 2002.  Released
#free of any restrictions, and without any warranty.

#execution:  checktables.pl filename.html

open IN, '<', $ARGV[0] or die "Could not open $ARGV[0]: $!";

my $lastline = 0;
my $last;

#Open each line of the file.
while ( <IN> ) {
    chomp;
    #Split the line on the > character.
    foreach ( map lc, split />/ ) {
        #Take the portion of the line containing an HTML tag

        next unless m!(</?t(?:able|d|r))\b!;
        #print "Tag found in $_.\n";

        my $curr = $1;

        #if we found a valid table tag
        #If this is not the first
        #iteration of this block
        unless ( $curr and $last ) {
            $last = $curr;
            next;
            }

        if ( $last eq '<table' and $curr ne '<tr' ) {
            print "Line $.: found $curr instead of <tr after <table from line 
$lastline.\n";
            }
        if ( $last eq '</table' and $curr ne '<table' and $curr ne '</td' ) {
            print "Line $.: found $curr instead of </td or <table after </table from 
line $lastline.\n";
            }
        if ( $last eq '<tr' and $curr ne '<td' ) {
            print "Line $.: found $curr instead of <td after $last from line 
$lastline.\n";
            }
        if ( $last eq '</tr' and $curr ne '</table' and $curr ne '<tr' ) {
            print "Line $.: found $curr instead of <tr or </table after $last from 
line $lastline.\n";
            }
        if ( $last eq '<td' and $curr ne '<table' and $curr ne '</td' ) {
            print "Line $.: found $curr instead of <table or </td after $last from 
line $lastline.\n";
            }
        if ( $last eq '</td' and $curr ne '<td' and $curr ne '</tr' ) {
            print "Line $.: found $curr instead of <td or </tr after $last from line 
$lastline.\n";
            }
        $last = $curr;
        $lastline = $.;
        }
    }

print "Check complete.\n";

__END__



John
-- 
use Perl;
program
fulfillment

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to