From: WC -Sx- Jones <[EMAIL PROTECTED]>
> Comments about how to perform these
> 5 checks as ONE TEST are welcome -
> 
> /\s+\w+\<g.*\>\w+\s+/ REJECT Invalid HTML Spam vG
> /\s+\w+\<j.*\>\w+\s+/ REJECT Invalid HTML Spam vJ
> /\s+\w+\<k\w{3,}\>\w+\s+/ REJECT Invalid HTML Spam vK
> /\s+\w+\<y.*\>\w+\s+/ REJECT Invalid HTML Spam vY
> /\s+\w+\<z.*\>\w+\s+/ REJECT Invalid HTML Spam vZ

Are you really sure you do want to do it like this? This will not 
reject most incorrect tags, and it does not even try to protect you 
from malformed HTML and from cross-site-scripting.

You might want to do something like:

sub PolishHTML {
        my $str = shift;
        if ($AllowXHTML) {
                $str =~ 
s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\s*(?:[^" 
'><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*/?>|</\w[\w\d]*>|<!--.*?--
>|$)}
                         {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-
~').$2}gem;
        } else {
                $str =~ 
s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\s*(?:[^" 
'><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*>|</\w[\w\d]*>|<!--.*?--
>|$)}
                         {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-
~').$2}gem;
        }
        return $str;
}

first to "polish" the HTML and escape the stuff that does no look 
like proper HTML and then use something like HTML::JFilter or 
HTML::Filter to get rid of nonexistant or dangerous tags and 
attributes.
 
> I am not interested in a Perl module
> as the "pcre" environment I am using
> would require huge CPU eating filters.

Doesn't look like it but if you did happen to need it under Windows 
you may find both the polishing and filtering in 
http://Jenda.Krynicky.cz/#Jenda.Rex COM object.

> Maybe something like this working/production code:
> 
> /\s+\w+\<(?=g|j|y|z).*\>\w+\s+/ REJECT Invalid HTML
> 
> But I am not sure how to handle the
> K (kbd) which ultimately could be valid...

You should also be able to handle things like this:
        <input type=text name=foo value="100>10 and 20<x">
and
        <inpt ...>

and most probably be able to remove these

        <img src="..." onMouseOver="window.open('http://...')">

Jenda
===== [EMAIL PROTECTED] === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed 
to get drunk and croon as much as they like.
        -- Terry Pratchett in Sourcery


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to