Thanks Lars for the tool. I wrote exactly the same thing in Perl (on your request!) some time ago. I have attached it to this mail.
I don't know which version is better. It looks like Lars' implementation has hard coded a lot of HTML tags for processing. Mine is based on Perl's HTML::Parser class and is thus independent of any specific HTML tags. Thanks, Chris -- Christian Schwarz [EMAIL PROTECTED], [EMAIL PROTECTED], Debian is looking [EMAIL PROTECTED], [EMAIL PROTECTED] for a logo! Have a look at our drafts PGP-fp: 8F 61 EB 6D CF 23 CA D7 34 05 14 5C C8 DC 22 BA at http://fatman.mathematik.tu-muenchen.de/~schwarz/debian-logo/
#!/usr/bin/perl # # fixhtmlgz 0.2 # Copyright (c) 1997 by Christian Schwarz <[EMAIL PROTECTED]> # May by distributed under GPL 2. # # Specification: # # Currently, we have a problem with compressed HTML: we can access # compressed HTML fine, but links don't work very well. The problem # is that the link says "foo.html", and the actual file is # "foo.html.gz", # and the browsers and servers aren't intelligent enough to handle # this invisibly. This means that we can't install compressed HTML, if # it contains links. # # We need a program that can be run on uncompressed HTML, which converts # local links to the compressed versions of the files. Usage would # be something like: # # fixhtmlgz file.html ... # # - read file.html # - for each link <a href="foo.html">, if foo.html exists, # convert the link to foo.html.gz instead # - otherwise, do not modify the link # - output is either to file.html.fixed or file.html (replace # original with modified version) # # Changes: # v0.2: # - now handles gzipped files # - parse .html and .htm files # - changed replacing rule: change href to refer to the # file, as it actually exists. Example: # <a href="foo.html"> will only be converted to # foo.html.gz, if this file exists, and not if # foo.html exists. # package Parser; #------------------------------- require HTML::Parser; @ISA = qw(HTML::Parser); sub declaration { my ($self, $decl) = @_; print ::OUT "<!$decl>"; } sub start { my ($self, $tag, $attr, $attrseq, $origtext) = @_; if ($tag eq 'a') { if ($href = $$attr{'href'}) { if (!($href =~ s/^(\S+:)//o) or ($1 =~ /file:/i)) { $type = $1; $href =~ s/(\#.*)$//o; $anchor = $1; #print "href: ($type,$href,$anchor)\n"; if (($href =~ /\.html$/) and -f $href) { # append `.gz' $$attr{'href'} = "$type$href.gz$anchor"; # rebuild origtext. $origtext = "<a"; for $tag (@$attrseq) { if ($$attr{$tag}) { $origtext .= " $tag=\"$$attr{$tag}\""; } else { $origtext .= " $tag"; } } $origtext .= ">"; } } } } pass: print ::OUT "$origtext"; } sub end { my ($self, $tag) = @_; print ::OUT "</$tag>"; } sub text { my ($self, $text) = @_; print ::OUT "$text"; } sub comment { my ($self, $comment) = @_; print ::OUT "<!--$comment-->"; } ######################################################################### package main; if ($#ARGV == -1) { print "usage: fixhtmlgz <html file> ...\n"; exit 1; } $p = Parser->new; while ($filename = shift) { if ( ! -f $filename ) { print "error: file $filename not found, skipping.\n"; next; } $output = "$filename.fixed"; open(OUT,">$output") or die "cannot open output file $output: $!"; $p->parse_file($filename); close(OUT); rename($filename,"$filename.bak") or die "cannot rename $filename: $!"; rename($output,$filename) or die "cannot rename $output: $!"; } exit 0;