In article <[EMAIL PROTECTED]>,
 [EMAIL PROTECTED] (Douglass Franklin) writes:
>I'm trying to transform this html table to a colon-delimited flat-file
>database.  This is what I have so far:
>
>HTML:
><tr><td class='bodyblack' width='50%'><a
>href='http://jsearch.usajobs.opm.gov/summary.asp?OPMControl=IC9516'
>class='jobrlist'><font size='2'>ACCOUNTANT
></font></a></td><td class='bodyblack' width='40%'>$24,701.00
> - $51,971.00
></td><td class='bodyblack'>INDEFINITE</td></tr>
><tr><td class='bodyblack'>CONTINENTAL U.S., US</td>
></tr><td class='bodyblack' colspan='3'>&nbsp </td></tr>
>
>Database Record (wanted):
>Accountant:$24,701.00 - $51,971.00:INDEFINITE:CONTINENTAL U.S., US
>
>Regex I have:
>$jobrecord =~ ^(<tr>)(<td class='bodyblack' width='50%'>)(.+)(&nbsp
></td></tr>)$
>
>However, this doesn't seem to be working.  Please help.

What you pasted isn't Perl code; there are so regex delimiters.
I assume you had them and then go on to use $1 etc.

Let's take a look at the input.  What you want is the non-white space
content between > and <, ignoring the final element.  So:

my @fields = $jobrecord =~ />\s*([^<]*[^<\s])/g;
pop @fields;
$fields[0] = ucfirst lc $fields[0];
print join(":", map { tr/\n/ /; $_ } @fields), "\n";

We ignore leading white space by skipping past \s*.
Then we get zero or more things that aren't <, followed
by a character that's not < or white space, thereby
getting at least one character and avoiding trailing
white space.  Then we pop off the &nbsp;, adjust the
case of the first element, and print everything out
separated by :, turning newlines into spaces along the way.

-- 
Peter Scott
http://www.perldebugged.com

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to