Thanks Lars for the tool.

I wrote exactly the same thing in Perl (on your request!) some time ago. I
have attached it to this mail.

I don't know which version is better. It looks like Lars' implementation
has hard coded a lot of HTML tags for processing. Mine is based on Perl's
HTML::Parser class and is thus independent of any specific HTML tags.


Thanks,

Chris

--                  Christian Schwarz
                     [EMAIL PROTECTED], [EMAIL PROTECTED],
Debian is looking     [EMAIL PROTECTED], [EMAIL PROTECTED]
for a logo! Have a
look at our drafts     PGP-fp: 8F 61 EB 6D CF 23 CA D7  34 05 14 5C C8 DC 22 BA
at    http://fatman.mathematik.tu-muenchen.de/~schwarz/debian-logo/
#!/usr/bin/perl
#
# fixhtmlgz 0.2
# Copyright (c) 1997 by Christian Schwarz <[EMAIL PROTECTED]>
# May by distributed under GPL 2.
#

# Specification:
#
# Currently, we have a problem with compressed HTML: we can access
# compressed HTML fine, but links don't work very well. The problem
# is that the link says "foo.html", and the actual file is
# "foo.html.gz",
# and the browsers and servers aren't intelligent enough to handle
# this invisibly. This means that we can't install compressed HTML, if
# it contains links.
# 
# We need a program that can be run on uncompressed HTML, which converts
# local links to the compressed versions of the files. Usage would
# be something like:
# 
#         fixhtmlgz file.html ...
# 
#         - read file.html
#         - for each link <a href="foo.html">, if foo.html exists,
#           convert the link to foo.html.gz instead
#         - otherwise, do not modify the link
#         - output is either to file.html.fixed or file.html (replace
#           original with modified version)
#
# Changes:
#      v0.2:
#         - now handles gzipped files
#         - parse .html and .htm files
#         - changed replacing rule: change href to refer to the
#           file, as it actually exists. Example:
#               <a href="foo.html"> will only be converted to
#               foo.html.gz, if this file exists, and not if
#               foo.html exists.
# 

package Parser; #-------------------------------
require HTML::Parser;
@ISA = qw(HTML::Parser);

sub declaration {
  my ($self, $decl) = @_;
  print ::OUT "<!$decl>";
}

sub start {
  my ($self, $tag, $attr, $attrseq, $origtext) = @_;

  if ($tag eq 'a') {
    if ($href = $$attr{'href'}) {
      if (!($href =~ s/^(\S+:)//o) or ($1 =~ /file:/i)) {
        $type = $1;
        $href =~ s/(\#.*)$//o;
        $anchor = $1;
        #print "href: ($type,$href,$anchor)\n";
        if (($href =~ /\.html$/) and -f $href) {
          # append `.gz'
          $$attr{'href'} = "$type$href.gz$anchor";
          # rebuild origtext.
          $origtext = "<a";
          for $tag (@$attrseq) {
            if ($$attr{$tag}) {
              $origtext .= " $tag=\"$$attr{$tag}\"";
            } else {
              $origtext .= " $tag";
            }
          }
          $origtext .= ">";
        }
      }
    }
  }

pass:
  print ::OUT "$origtext";
}

sub end {
  my ($self, $tag) = @_;
  print ::OUT "</$tag>";
}

sub text {
  my ($self, $text) = @_;
  print ::OUT "$text";
}

sub comment {
  my ($self, $comment) = @_;
  print ::OUT "<!--$comment-->";
}

#########################################################################

package main;

if ($#ARGV == -1) {
  print "usage: fixhtmlgz <html file> ...\n";
  exit 1;
}

$p = Parser->new;

while ($filename = shift) {
  if ( ! -f $filename ) {
    print "error: file $filename not found, skipping.\n";
    next;
  }

  $output = "$filename.fixed";
  open(OUT,">$output") or die "cannot open output file $output: $!";

  $p->parse_file($filename);

  close(OUT);

  rename($filename,"$filename.bak") or die "cannot rename $filename: $!";
  rename($output,$filename) or die "cannot rename $output: $!";
}

exit 0;

Reply via email to