RE: Stripping scripts from HTML with regular expressions

Reedick, Andrew Wed, 09 Apr 2008 13:13:02 -0700


> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:python-
> [EMAIL PROTECTED] On Behalf Of Michel Bouwmans
> Sent: Wednesday, April 09, 2008 3:38 PM
> To: python-list@python.org
> Subject: Stripping scripts from HTML with regular expressions
> 
> Hey everyone,
> 
> I'm trying to strip all script-blocks from a HTML-file using regex.
> 
> I tried the following in Python:
> 
> testfile = open('testfile')
> testhtml = testfile.read()
> regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL)
> result = regex.sub('', blaat)
> print result
> 
> This strips far more away then just the script-blocks. Am I missing
> something from the regex-implementation from Python or am I doing
> something
> else wrong?
>


[Insert obligatory comment about using a html specific parser
(HTMLParser) instead of regexes.]

Actually your regex didn't appear to strip anything.  You probably saw
stuff disappear because blaat != testhtml:
        testhtml = testfile.read()
        result = regex.sub('', blaat)


Try this:

import re

testfile = open('a.html')
testhtml = testfile.read()
regex = re.compile('<script\s+.*?>(.*?)</script>', re.DOTALL)
result = regex.sub('',testhtml)

print result




-- 
http://mail.python.org/mailman/listinfo/python-list

RE: Stripping scripts from HTML with regular expressions

Reply via email to