Reedick, Andrew wrote: > > >> -----Original Message----- >> From: [EMAIL PROTECTED] [mailto:python- >> [EMAIL PROTECTED] On Behalf Of Michel Bouwmans >> Sent: Wednesday, April 09, 2008 3:38 PM >> To: python-list@python.org >> Subject: Stripping scripts from HTML with regular expressions >> >> Hey everyone, >> >> I'm trying to strip all script-blocks from a HTML-file using regex. >> >> I tried the following in Python: >> >> testfile = open('testfile') >> testhtml = testfile.read() >> regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL) > > > Aha! \b is being interpolated as a backspace character: > \b ASCII Backspace (BS) > > Always use a raw string with regexes: > regex = re.compile(r'<script\b[^>]*>(.*?)</script>', re.DOTALL) > > Your regex should now work. > > > > ***** > > The information transmitted is intended only for the person or entity to > which it is addressed and may contain confidential, proprietary, and/or > privileged material. Any review, retransmission, dissemination or other > use of, or taking of any action in reliance upon this information by > persons or entities other than the intended recipient is prohibited. If > you received this in error, please contact the sender and delete the > material from all computers. GA622
Thanks! That did the trick. :) I was trying to use HTMLParser but that choked on the script-blocks that didn't contain comment-indicators. Guess I can now move on with this script, thank you. MFB -- http://mail.python.org/mailman/listinfo/python-list