Re: Matching Over Linefeed and Space

R. Joseph Newton Sun, 02 Nov 2003 16:29:25 -0800

Clint wrote:

> I'm trying to scrape a section of html and don't see why my regexp
> stopped working this week. The relevant two-line sample section from
>
> http://www.srh.noaa.gov/data/forecasts/SCZ020.php
>
> is:
>
>   <td><b>Barometer</b>:</td>
>   <td align="right" nowrap>30.22&quot; (1023.1 mb)</td>
>
> (notice the space before <td align="right" nowrap>)
>
> My Perl segment (that was working until this week) is:
>
> sub barometer {
>      local $_ = shift;


Don't do that.  You are not using the default variable in the code at all,
and it is a bad habit to be routinely tweaking system variables.  What
benefit did you expect from this construct?

>
>      m{<td><b>Barometer</b>:</td>\n\s<td align="right"
> nowrap>(.*?)&quot;} || die "No barometer data";
>      return $1;
> }

You are being too rigid with your space matching.  Are you sure there will
never be trailing space on the end of the first line?  Or that the space at
the beginning of the second will always be single?  It would probably work
just as wellto seek the first line, then take the value from the next line:

Greetings! E:\d_drive\perlStuff>perl -w
open IN, 'SCZ020.html' or die "could not open $!";
my $line = <IN>;
while (not $line =~ /Barometer/) {$line = <IN>}
$line = <IN>;
if ($line =~ /\>\s*([\d\.]+)\s*\&quot/) {
  print $1;
} else {
  print "$line did not contain match\n";
}
^Z
30.15

> Is there something obvious in the html structure that I've missed here?

Probably.  Your assuming that the presentation format will remain constant.

>
> I appreciate any advice you might have.

Web scraping in general is pretty undependable.  You have no real guarantee
that a given page will retain its existing format, and almost any extraction
technique will rely on a fairly constant presentation format.  It might be
better to explore more deeply and see if the agancy makes information
available in more direct data transmission formats.

Joseph


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Matching Over Linefeed and Space

Reply via email to