Clint wrote: > I'm trying to scrape a section of html and don't see why my regexp > stopped working this week. The relevant two-line sample section from > > http://www.srh.noaa.gov/data/forecasts/SCZ020.php > > is: > > <td><b>Barometer</b>:</td> > <td align="right" nowrap>30.22" (1023.1 mb)</td> > > (notice the space before <td align="right" nowrap>) > > My Perl segment (that was working until this week) is: > > sub barometer { > local $_ = shift;
Don't do that. You are not using the default variable in the code at all, and it is a bad habit to be routinely tweaking system variables. What benefit did you expect from this construct? > > m{<td><b>Barometer</b>:</td>\n\s<td align="right" > nowrap>(.*?)"} || die "No barometer data"; > return $1; > } You are being too rigid with your space matching. Are you sure there will never be trailing space on the end of the first line? Or that the space at the beginning of the second will always be single? It would probably work just as wellto seek the first line, then take the value from the next line: Greetings! E:\d_drive\perlStuff>perl -w open IN, 'SCZ020.html' or die "could not open $!"; my $line = <IN>; while (not $line =~ /Barometer/) {$line = <IN>} $line = <IN>; if ($line =~ /\>\s*([\d\.]+)\s*\"/) { print $1; } else { print "$line did not contain match\n"; } ^Z 30.15 > Is there something obvious in the html structure that I've missed here? Probably. Your assuming that the presentation format will remain constant. > > I appreciate any advice you might have. Web scraping in general is pretty undependable. You have no real guarantee that a given page will retain its existing format, and almost any extraction technique will rely on a fairly constant presentation format. It might be better to explore more deeply and see if the agancy makes information available in more direct data transmission formats. Joseph -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]