This accomplishes what I want it to, but it's ugly. Any ideas on how I can achieve the same result without opening and closing the OUT file twice? I'll have to open it yet again to populate a DB column with the stripped HTML, and it just seems like I've got WAY too much file activity happening here when the results could really be accomplished inside the script without the file interaction.
#!/usr/bin/perl -w use strict; $/ = ''; # Let's slurp in paragraph mode to begin with my $inputfile = '/path/to/DU051503USBZ.htm'; my $outputfile = '/path/to/out.htm'; my @cleandoc = (); # Rather than printing the whole document to the new file # a line at a time, we'll build an array and simply print # the array to the document. Whether this saves any CPU or # not is beyond me... # Clean the crud out of the uploaded file and print to a new, clean file open(IN, "<$inputfile") or die qq(Couldn\'t open $inputfile: $! \n); open(OUT, ">$outputfile") or die qq(Couldn\'t open $outputfile: $! \n); while(<IN>) { chomp; $_ =~ s/<\/*html>//ig; # We'll be printing our own headers $_ =~ s/<\/*head>//ig; $_ =~ s/<title>(.*?)<\/title>//ig; $_ =~ s/<\/*body(.*?)>//ig; $_ =~ s/(<p align=\"center\">)(U.S. BUSINESS JOURNAL)//ig; $_ =~ s/(<p align=\"center\">)*(\(The Nation\'s Oldest Daily Business E-Newspaper\))//ig; $_ =~ s/strong>/b>/ig; $_ =~ s/<\/*u>//ig; # Icky, Netscape doesn't like <u> $_ =~ s/<br wp=(.*?)>{1,}//ig; # Get rid of bizarre WordPerfect sludge! $_ =~ s/\r{2,}//g; $_ =~ s/\n{2,}//g; # Probably not necessary with chomp $_ =~ s/\^M//ig; # Windows can't be trusted push(@cleandoc, $_); } print OUT @cleandoc; close(OUT); close(IN); # Now let's slurp in whole document mode and print each of the sections, # split on <p align="center">, to the outfile so each section is on a # line of its own. We'll use this again later to open the cleaned-up file # in line mode and print the results to our database. undef $/; open(OUT, ">$outputfile") or die qq(Couldn\'t open $outputfile: $! \n); foreach my $section(@cleandoc) { chomp($section); my $chunk = join "\n", split /<p align=\"center\">/, $section; print OUT "$chunk"; } close(OUT); -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]