> -----Original Message----- > From: [EMAIL PROTECTED] [mailto:python- > [EMAIL PROTECTED] On Behalf Of Michel Bouwmans > Sent: Wednesday, April 09, 2008 3:38 PM > To: python-list@python.org > Subject: Stripping scripts from HTML with regular expressions > > Hey everyone, > > I'm trying to strip all script-blocks from a HTML-file using regex. > > I tried the following in Python: > > testfile = open('testfile') > testhtml = testfile.read() > regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL) > result = regex.sub('', blaat) > print result > > This strips far more away then just the script-blocks. Am I missing > something from the regex-implementation from Python or am I doing > something > else wrong? >
[Insert obligatory comment about using a html specific parser (HTMLParser) instead of regexes.] Actually your regex didn't appear to strip anything. You probably saw stuff disappear because blaat != testhtml: testhtml = testfile.read() result = regex.sub('', blaat) Try this: import re testfile = open('a.html') testhtml = testfile.read() regex = re.compile('<script\s+.*?>(.*?)</script>', re.DOTALL) result = regex.sub('',testhtml) print result -- http://mail.python.org/mailman/listinfo/python-list