On Thursday, Mar 20, 2003, at 11:26 US/Pacific, Kipp, James wrote:


I'm saying it could be bgcolor="COLOR" or bgcolor=COLOR

Yes I realize. I believe drieux's solution, or an adaptation of it, is what you need

note: I do subs because it is easier for me to 'loop on them' and if they are worth it, they get stuffed in a perl module somewhere...

[..]
        #------------------------
        #
        sub un_colour {
                my ($line) = @_;
        
                $line =~ s/\s*bgcolor=("?)([^">\s]+)("?)//gi ;

        $line;
     } # end of un_colour

the usage would be


my $new_html_text = un_colour($html_text);

Or you could just use the line itself.

If it helps to break out the sequence

        s/\s*           #       one or more white space before
          bgcolor=      # the specific text
          ("?)             # first conditional group - "
          ([^">\s]+) # middle group -
          ("?)             # third conditional group
        //gi

since the middle element needs to guard against

        a. "
        b. >
        c. white space

Note that we are looking for at least one or more characters of the 'class' [^">\s] - or is english

        not "                      ::       let the 3rd group grab this
        not >                        ::   the end of tag token
        not white space ::   the end of attribute delimiter

since we are looking for the set of characters
that are 'not delimiters' - perchance the bass-end-akward
way of making a set....

since <COLOR> in this context is both:

        a. the secquence of alpha characters
        b. a # preceeded hexit numeric sequence

I figured it would be easier to NOT go with
the more complex regex that would need to note
that 'if preceded by a #, then must be numeric...'
Yeech, way to much work on that side of the trail.

The test case code had to include BOTH the ">"
and the white space components so that it would
correctly parse not merely the specific cases
we are concerned about - but those cases in
their 'natural enviornment' eg

        <body bgcolor=red other="fred">
        <body bgcolor=red>
        <body bgcolor="red" other="fred">
        <body bgcolor="#CCCCFF" other="fred">
        <body bgcolor="#ccccff">
        ....

remember that bgcolor is an attribute in a tag.

Or allow me to argue the defect in the initial idea

$line =~ s/ *bgcolor=("?)(.*)("?)//gi ;

the problem is that middle group - the "match one
or more of anything... A very GREEDY GRAB - since
it would take say

<body bgcolor="red" other="fred">

and make that

<bodyfred">

since the sequence - with the round braces
delimiting the group matches:

/ bgcolor=(")(red" other=)(")/

is the most greedy grab possible. Which may have
been what you were noticing in the output.

So the simplest solution appeared to be to
work out the list of things that were 'delimiters'
and then allow anything in the middle group
that was not a delimiter...

HTH...


ciao drieux

---


-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to