RE: Another regex question

Scot Robnett Fri, 30 May 2003 03:35:20 -0700

I read perlfaq6. Several times, in fact, and specifically the section you're
pointing out. I'm one of those people who needs a good beating to understand
something, but once I "get it" I don't forget...just not there yet.


I basically want to extract the first several lines of the file, maybe to
discard or maybe to use later, so I guess the section needs to be matched
and stored into $1, $2, something like that. Everything up to and including
the line containing "Today's Headlines:", I want to skip for now. After the
"Today's Headlines:" line, all of the articles should be split on <p>.

HTML file immediately below, beginning of regex stuff underneath that.

####### HTML file #######
<html>
<head>
<title></title>
</head>
<body text="#000000" link="#0000ff" vlink="#551a8b" alink="#ff0000"
bgcolor="#ffffff">

<p align="center">PUBLICATION TITLE

<p align="center">(Publication Subtitle)

<br wp="br1"><br wp="br2">
<p align="center">
<strong>May 20, 2003</strong>

<br wp="br1"><br wp="br2">
<p align="center">(copyright notice)

<br wp="br1"><br wp="br2">
<br wp="br1"><br wp="br2">
<br wp="br1"><br wp="br2">
<p align="center">SOME ADVERTISEMENT HEADING

<br wp="br1"><br wp="br2">
<p align="center">

<p>blah blah blah advertisment blah blah blah advertisement blah blah blah
advertisement blah blah blah advertisement blah blah blah advertisement.

<br wp="br1"><br wp="br2">
<br wp="br1"><br wp="br2">
<br wp="br1"><br wp="br2">
<p align="center"><strong>Tip of the Day</strong>

<br wp="br1"><br wp="br2">
<p align="center"><strong>Tip of the Day Subject</strong>

<br wp="br1"><br wp="br2">
<p>blah blah blah tip of the day blah blah blah tip of the day blah blah
blah tip of the day blah blah blah tip of the day blah blah blah tip of the
day.

<br wp="br1"><br wp="br2">
<br wp="br1"><br wp="br2">
<br wp="br1"><br wp="br2">
<p align="center"><strong>Today's Headlines:</strong>

<br wp="br1"><br wp="br2">
<br wp="br1"><br wp="br2">
<p><strong>Story Tag Line</strong> blah blah blah story blah blah blah story
blah blah blah story blah blah blah story blah blah blah story blah blah
blah story  blah blah blah story blah blah blah story blah blah blah story.

<br wp="br1"><br wp="br2">
<br wp="br1"><br wp="br2">
<p><strong>Story Tag Line</strong> blah blah blah story blah blah blah story
blah blah blah story blah blah blah story blah blah blah story blah blah
blah story  blah blah blah story blah blah blah story blah blah blah story.

<br wp="br1"><br wp="br2">
<br wp="br1"><br wp="br2">
<p><strong>Story Tag Line</strong> blah blah blah story blah blah blah story
blah blah blah story blah blah blah story blah blah blah story blah blah
blah story  blah blah blah story blah blah blah story blah blah blah story.

####### /HTML file #######



####### Regex kludge ######
#!/usr/bin/perl -w

use strict;
use CGI qw(:all);
use CGI::Carp qw(fatalsToBrowser);
$/ = "";

my $q = new CGI;
my $action = $q->param('action');
my $infile = q(/path/to/file.html);
my $outfile = q(/path/to/file2.html);
my @articles = ();
if ($action eq "Display Articles") {
 cleanup;
 show_articles;
}

sub cleanup {
 open(IN,"<$infile") or die "Could not open $infile: $!\n";
 @articles = <IN>;
 close(IN);
 open(OUT,">$outfile") or die "Could not open $outfile: $!\n";
 foreach my $element(@articles) {
  chomp($element);                        # bong the newlines
  $element =~ s/(\n+)|(\r+)//ig;          # bong the linefeeds
  $element =~ s/<html>|<\/html>//ig;      # will add own header/footer
  $element =~ s/<head>(.*)<\/head>//ig;   # will add own head content
  $element =~ s/<title>(.*)<\/title>//ig; # will create own title
  $element =~ s/(<br wp=\"br(\d{1})\">)+//ig; # get rid of funky WP stuff
  $element =~ s/^<body(.*)>$//ig;         # will add own body tag
  $element =~ s/<\/body>//ig;             # this will be in custom footer
  $element =~ s/strong>/b>/ig;            # <strong> to <b>
  $element =~ s/<u>|<\/u>//ig;            # Netscape doesn't like <u>
  print OUT "$element \n";                # Print cleaned content to file
 }
 close(OUT);
}

sub show_articles {
 print_custom_header;
 open(NEWIN,"<$outfile") or die "Could not open $outfile: $! \n";
 while(<>) {
  # Here is where I want to do the multiline regex
  # Match everything up to "Today's Headlines:", don't print
  # Split the articles on <p> and print them
 }
 close(NEWIN);
 print_custom_footer;
}

sub print_custom_header { # stuff }
sub print_custom_footer { # stuff }

####### /Regex kludge ######


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Another regex question

Reply via email to