Seems like the same results would be achieved by not opening it at all the first time through.
You're right, I commented out the extra slurpage, moved close(OUT) and close(IN) and it still worked. So basically you have a very elaborate and expensive no-op. Story of my life. ;-) If you really want the whole document to appear to be slurped (which it isn't currently) then rather than pushing to the array @cleandoc, concatenate to a scalar with '.='. Along the lines of while(<>) { my $doc .= $_; } # ? really depends on your purposes (aka do you ever need the file in memory at once?) and the size of your files (difficult to slurp a 2 GB file into 1 GB of RAM)... It's always an HTML file under 200K, so there should never be that problem.... You might also be better off using an HTML parser, in general it will do a better (and often faster) job than a whole list of regex's which are almost guaranteed to mess up oddly structured HTML. There are several available on CPAN. I looked at HTML::Parser and I'd have to "teach" it about the funky WordPerfect-generated tags. This file is a newsletter that is published in exactly the same format, exactly the same way, with no alterations in the structure. The only thing that changes is the content of the articles, so the homemade regex seems to work fine on any of the newsletters it gets fed. In general, point well-taken, though. http://danconia.org > > > #!/usr/bin/perl -w > > use strict; > > $/ = ''; # Let's slurp in paragraph mode to begin with > > my $inputfile = '/path/to/DU051503USBZ.htm'; > my $outputfile = '/path/to/out.htm'; > > my @cleandoc = (); # Rather than printing the whole document to the new file > # a line at a time, we'll build an array and simply print > # the array to the document. Whether this saves any CPU > or > # not is beyond me... > > # Clean the crud out of the uploaded file and print to a new, clean file > > open(IN, "<$inputfile") or die qq(Couldn\'t open $inputfile: $! \n); > open(OUT, ">$outputfile") or die qq(Couldn\'t open $outputfile: $! \n); > while(<IN>) { > chomp; > $_ =~ s/<\/*html>//ig; # We'll be printing our own headers > $_ =~ s/<\/*head>//ig; > $_ =~ s/<title>(.*?)<\/title>//ig; > $_ =~ s/<\/*body(.*?)>//ig; > $_ =~ s/(<p align=\"center\">)(U.S. BUSINESS JOURNAL)//ig; > $_ =~ s/(<p align=\"center\">)*(\(The Nation\'s Oldest Daily Business > E-Newspaper\))//ig; > $_ =~ s/strong>/b>/ig; > $_ =~ s/<\/*u>//ig; # Icky, Netscape doesn't like <u> > $_ =~ s/<br wp=(.*?)>{1,}//ig; # Get rid of bizarre WordPerfect > sludge! > $_ =~ s/\r{2,}//g; > $_ =~ s/\n{2,}//g; # Probably not necessary with chomp > $_ =~ s/\^M//ig; # Windows can't be trusted > push(@cleandoc, $_); > } > print OUT @cleandoc; > close(OUT); > close(IN); > > > # Now let's slurp in whole document mode and print each of the sections, > # split on <p align="center">, to the outfile so each section is on a > # line of its own. We'll use this again later to open the cleaned-up file > # in line mode and print the results to our database. > > undef $/; > open(OUT, ">$outputfile") or die qq(Couldn\'t open $outputfile: $! \n); > foreach my $section(@cleandoc) { > chomp($section); > my $chunk = join "\n", split /<p align=\"center\">/, $section; > print OUT "$chunk"; > } > > close(OUT); > > >
-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]