I have used this little routine to strip HTML.  Might be ineffecient, I
don't know..

Assuming HTML has been loaded into variable $html
$html=~ s/\n//g;
$html=~ s/>/>\n/g;
@html=split(/\n/, $html);
foreach $_(@html)
{
    $_=~ s/<.*>//g;
    $newhtml.=$_;
}
print $newhtml;

Agustin Rivera

----- Original Message -----
From: "Etienne Marcotte" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, November 13, 2001 11:44 AM
Subject: YARQ (Yet Another Regexp Question)


> I saw somewhere on the web a good regexp for removing html tags. Can't
> re-find it and it needed some minor mods.
>
> Let's say the $line is 'this is a <font size="2">large word</font>in
> size 2';
>
> I played a little around, but it always removed between the first < and
> the last > (and I knwo the tutorial on the web said how to avoid this)
>
> I'd like to make something like this (I know this one's not good, but
> please help place parenthesis and [] and {} :)
>
>    .*     < (.*) \s    .*    >     .*     </  \1  >    .*
> this is a < font    size="2" > large word </ font > in size 2
>
> the above line show what is the match for each part...
>
> thanks for help...
>
> And also is tthe a way to specify a list of allowed tags? or a list of
> unallowed tags.
> like if the (.*) is foo or bar to delete, keep is something else...
>
> I don't think it's clear, but I'll try to help if you need more details
> on what I'm trying to accomplish
>
> Etienne
>
> --
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to