join question

Scot Robnett Sat, 14 Jun 2003 20:47:45 -0700

Seems like the same results would be achieved by not opening it at all
the first time through.


        You're right, I commented out the extra
        slurpage, moved close(OUT) and close(IN)
        and it still worked.


So basically you have a very elaborate and expensive no-op.

        Story of my life. ;-)


If you really want the whole document to appear to be
slurped (which it isn't currently) then rather than pushing to the array
@cleandoc, concatenate to a scalar with '.='.

        Along the lines of while(<>) { my $doc .= $_; }  # ?



really depends on your purposes (aka do you ever need the file in memory
at once?) and the size of your files (difficult to slurp a 2 GB file
into 1 GB of RAM)...

        It's always an HTML file under 200K, so there
        should never be that problem....


You might also be better off using an HTML parser, in general it will do
a better (and often faster) job than a whole list of regex's which are
almost guaranteed to mess up oddly structured HTML.  There are several
available on CPAN.

        I looked at HTML::Parser and I'd have to "teach"
        it about the funky WordPerfect-generated tags.
        This file is a newsletter that is published in
        exactly the same format, exactly the same way,
        with no alterations in the structure. The only
        thing that changes is the content of the articles,
        so the homemade regex seems to work fine on any
        of the newsletters it gets fed. In general, point
        well-taken, though.



http://danconia.org

>
>
> #!/usr/bin/perl -w
>
> use strict;
>
> $/ = '';           # Let's slurp in paragraph mode to begin with
>
> my $inputfile    = '/path/to/DU051503USBZ.htm';
> my $outputfile   = '/path/to/out.htm';
>
> my @cleandoc = (); # Rather than printing the whole document to the new
file
>                    # a line at a time, we'll build an array and simply
print
>                    # the array to the document. Whether this saves any CPU
> or
>                    # not is beyond me...
>
> # Clean the crud out of the uploaded file and print to a new, clean file
>
> open(IN, "<$inputfile") or die qq(Couldn\'t open $inputfile: $! \n);
> open(OUT, ">$outputfile") or die qq(Couldn\'t open $outputfile: $! \n);
> while(<IN>) {
>  chomp;
>  $_ =~ s/<\/*html>//ig;              # We'll be printing our own headers
>  $_ =~ s/<\/*head>//ig;
>  $_ =~ s/<title>(.*?)<\/title>//ig;
>  $_ =~ s/<\/*body(.*?)>//ig;
>  $_ =~ s/(<p align=\"center\">)(U.S. BUSINESS JOURNAL)//ig;
>  $_ =~ s/(<p align=\"center\">)*(\(The Nation\'s Oldest Daily Business
> E-Newspaper\))//ig;
>  $_ =~ s/strong>/b>/ig;
>  $_ =~ s/<\/*u>//ig;                 # Icky, Netscape doesn't like <u>
>  $_ =~ s/<br wp=(.*?)>{1,}//ig;      # Get rid of bizarre WordPerfect
> sludge!
>  $_ =~ s/\r{2,}//g;
>  $_ =~ s/\n{2,}//g;                  # Probably not necessary with chomp
>  $_ =~ s/\^M//ig;                    # Windows can't be trusted
>  push(@cleandoc, $_);
> }
> print OUT @cleandoc;
> close(OUT);
> close(IN);
>
>
> # Now let's slurp in whole document mode and print each of the sections,
> # split on <p align="center">, to the outfile so each section is on a
> # line of its own. We'll use this again later to open the cleaned-up file
> # in line mode and print the results to our database.
>
> undef $/;
> open(OUT, ">$outputfile") or die qq(Couldn\'t open $outputfile: $! \n);
> foreach my $section(@cleandoc) {
>  chomp($section);
>  my $chunk = join "\n", split /<p align=\"center\">/, $section;
>  print OUT "$chunk";
> }
>
> close(OUT);
>
>
>

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: array/split/join question

Reply via email to