Shawn Milochik wrote: > > I have written a script which is very useful for me day-to-day. It checks > table structure in HTML files. The script is working, but I would > appreciate any comments, especially as to how this can be better written.
Ok, you asked for it. :-) If you want to check HTML files you might want to use a program like weblint[0] or a module like HTML::Lint[1] or use an HTML parser[2] to extract the tables instead of trying to do it with regular expressions. [0] http://packages.debian.org/unstable/web/weblint.html [1] http://search.cpan.org/author/PETDANCE/HTML-Lint-1.21/ [2] http://search.cpan.org/search?query=HTML+parse&mode=module > Code follows: > ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: > #!/usr/bin/perl Although your code compiles without warnings you should enable warnings when developing your program. You should also use strict and lexical variables. use warnings; use strict; > #This script looks for bad table > #structure in HTML pages. When using > #server-side scripting, do not run this > #on the source page -- load the page in > #a browser, view source, and save the source > #as a file. > > #This script does not take into account client-side > #scripting on the page. For example, if a Javascript > #function writes out table elements in a loop, this > #script will report an error. This is just a helpful > #tool for the web programmer who already knows what > #they are doing. > > #Created by Shawn Milochik, Oct 31 2002. Released > #free of any restrictions, and without any warranty. > > #execution: checktables.pl filename.html > > open (IN, "<" . $ARGV[0]) || die "Could not open input file.\n"; You should include the filename and the $! variable in your error message. open IN, "< $ARGV[0]" or die "Could not open $ARGV[0]: $!"; > $line = 0; > $last = undef; > > #Open each line of the file. > while (<IN>){ You should have the chomp() here instead. chomp; > #Keep track of the line number, so we > #can tell the user which line of the HTML has > #the problem. > $line++; Perl provides the $. variable which contains the current line number. > #Split the line on the > character. > @data = split />/, $_; > while (@data){ > #Take the portion of the line containing an HTML tag > chomp($test = lc(shift(@data))); You would normally use foreach to loop over the values in an array. Also, the chomp isn't required foreach ( @data ) { my $test = lc; Or just: foreach ( split />/ ) { my $test = lc; > if (($test =~ /<t/) || ($test =~ /<\/t/)){ You can simplifiy this a bit. if ( $test =~ m{</?t} ) { > #print "Tag found in $test.\n"; > > $curr = undef; > if ($test =~ "<table"){ $curr = "<table";} > if ($test =~ "<\/table"){ $curr = "<\/table";} > if ($test =~ "<tr"){ $curr = "<tr";} > if ($test =~ "<\/tr"){ $curr = "<\/tr";} > if ($test =~ "<td"){ $curr = "<td";} > if ($test =~ "<\/td"){ $curr = "<\/td";} Slashes don't have to be escaped in strings or regular expressions that don't use slashes as delimiters. if ($test =~ /<\/table/){ $curr = q/<\/table/ } # must escape slashes! if ($test =~ m^</table^){ $curr = q|</table| } # OK You can simplify those seven statements as: my $curr = $1 if $test =~ m!(</?t(?:able|d|r))\b!; > #if we found a valid table tag > if ($curr){ > #If this is not the first > #iteration of this block > if ($last){ > if ($last eq "<table"){ > if ($curr ne "<tr"){ print "Line $line: found $curr > instead of <tr after <table from line $lastline.\n";} > } > > if ($last eq "<\/table"){ > if (($curr ne "<table") && ($curr ne "<\/td")){ > print "Line $line: found $curr instead of <\/td or <table after <\/table > from line $lastline.\n";} > } > > if ($last eq "<tr"){ > if ($curr ne "<td"){ print "Line $line: found $curr > instead of <td after $last from line $lastline.\n";} > } > > if ($last eq "<\/tr"){ > if (($curr ne "<\/table") && ($curr ne "<tr")){ > print "Line $line: found $curr instead of <tr or <\/table after $last from > line $lastline.\n";} > } > > if ($last eq "<td"){ > if (($curr ne "<table") && ($curr ne "<\/td")){ > print "Line $line: found $curr instead of <table or <\/td after $last from > line $lastline.\n";} > } > > if ($last eq "<\/td"){ > if (($curr ne "<td") && ($curr ne "<\/tr")){ print > "Line $line: found $curr instead of <td or <\/tr after $last from line > $lastline.\n";} > } > > $last = $curr; > $lastline = $line; > }else{ > #First iteration, initialize > #$last > $last = $curr; > }#close curly brace for if ($last) block > }#close curly brace for if ($curr) block > > }#close curly brace for if (($test =~ /<t/) || ($test =~ /<\/t/)) > block > > }#close curly brace for while(@data) block > }#close curly brace for while(<IN>) block > > print "Check complete.\n"; > > ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: > End of Code You can remove a lot of the indentation by using next and combining some of those if statements. In summary: #!/usr/bin/perl use warnings; use strict; #This script looks for bad table #structure in HTML pages. When using #server-side scripting, do not run this #on the source page -- load the page in #a browser, view source, and save the source #as a file. #This script does not take into account client-side #scripting on the page. For example, if a Javascript #function writes out table elements in a loop, this #script will report an error. This is just a helpful #tool for the web programmer who already knows what #they are doing. #Created by Shawn Milochik, Oct 31 2002. Released #free of any restrictions, and without any warranty. #execution: checktables.pl filename.html open IN, '<', $ARGV[0] or die "Could not open $ARGV[0]: $!"; my $lastline = 0; my $last; #Open each line of the file. while ( <IN> ) { chomp; #Split the line on the > character. foreach ( map lc, split />/ ) { #Take the portion of the line containing an HTML tag next unless m!(</?t(?:able|d|r))\b!; #print "Tag found in $_.\n"; my $curr = $1; #if we found a valid table tag #If this is not the first #iteration of this block unless ( $curr and $last ) { $last = $curr; next; } if ( $last eq '<table' and $curr ne '<tr' ) { print "Line $.: found $curr instead of <tr after <table from line $lastline.\n"; } if ( $last eq '</table' and $curr ne '<table' and $curr ne '</td' ) { print "Line $.: found $curr instead of </td or <table after </table from line $lastline.\n"; } if ( $last eq '<tr' and $curr ne '<td' ) { print "Line $.: found $curr instead of <td after $last from line $lastline.\n"; } if ( $last eq '</tr' and $curr ne '</table' and $curr ne '<tr' ) { print "Line $.: found $curr instead of <tr or </table after $last from line $lastline.\n"; } if ( $last eq '<td' and $curr ne '<table' and $curr ne '</td' ) { print "Line $.: found $curr instead of <table or </td after $last from line $lastline.\n"; } if ( $last eq '</td' and $curr ne '<td' and $curr ne '</tr' ) { print "Line $.: found $curr instead of <td or </tr after $last from line $lastline.\n"; } $last = $curr; $lastline = $.; } } print "Check complete.\n"; __END__ John -- use Perl; program fulfillment -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]