On Thursday, Mar 20, 2003, at 11:26 US/Pacific, Kipp, James wrote:
I'm saying it could be bgcolor="COLOR" or bgcolor=COLOR
Yes I realize. I believe drieux's solution, or an adaptation of it, is what you need
note: I do subs because it is easier for me to 'loop on them' and if they are worth it, they get stuffed in a perl module somewhere...
[..]
#------------------------ # sub un_colour { my ($line) = @_; $line =~ s/\s*bgcolor=("?)([^">\s]+)("?)//gi ;
$line; } # end of un_colour
the usage would be
my $new_html_text = un_colour($html_text);
Or you could just use the line itself.
If it helps to break out the sequence
s/\s* # one or more white space before bgcolor= # the specific text ("?) # first conditional group - " ([^">\s]+) # middle group - ("?) # third conditional group //gi
since the middle element needs to guard against
a. " b. > c. white space
Note that we are looking for at least one or more characters of the 'class' [^">\s] - or is english
not " :: let the 3rd group grab this not > :: the end of tag token not white space :: the end of attribute delimiter
since we are looking for the set of characters that are 'not delimiters' - perchance the bass-end-akward way of making a set....
since <COLOR> in this context is both:
a. the secquence of alpha characters b. a # preceeded hexit numeric sequence
I figured it would be easier to NOT go with the more complex regex that would need to note that 'if preceded by a #, then must be numeric...' Yeech, way to much work on that side of the trail.
The test case code had to include BOTH the ">" and the white space components so that it would correctly parse not merely the specific cases we are concerned about - but those cases in their 'natural enviornment' eg
<body bgcolor=red other="fred"> <body bgcolor=red> <body bgcolor="red" other="fred"> <body bgcolor="#CCCCFF" other="fred"> <body bgcolor="#ccccff"> ....
remember that bgcolor is an attribute in a tag.
Or allow me to argue the defect in the initial idea
$line =~ s/ *bgcolor=("?)(.*)("?)//gi ;
the problem is that middle group - the "match one or more of anything... A very GREEDY GRAB - since it would take say
<body bgcolor="red" other="fred">
and make that
<bodyfred">
since the sequence - with the round braces delimiting the group matches:
/ bgcolor=(")(red" other=)(")/
is the most greedy grab possible. Which may have been what you were noticing in the output.
So the simplest solution appeared to be to work out the list of things that were 'delimiters' and then allow anything in the middle group that was not a delimiter...
HTH...
ciao drieux
---
-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]