Roman, Your re works for me. I suspect you have tags spanning lines, a thing you get more often than not. If so, processing linewise doesn't work. You need to catch the tags like this:
>>> text = re.sub ('<(.|\n)*?>', '', text) If your text is reasonably small I would recommend this solution. Else you might want to take a look at SE which is a stream edtor that does the buffering for you: http://cheeseshop.python.org/pypi/SE/2.2%20beta >>> import SE >>> Tag_Stripper = SE.SE (' "~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~=" ') >>> print Tag_Stripper (text) (... your text without tags ...) The Tag_Stripper is made up of two regexes. The second one catches comments which may nest tags. The first expression alone would also catch comments, but would mistake the '>' of the first nested tag for the end of the comment and quit prematurely. The example "re.sub ('<(.|\n)*?>', '', text)" above would misperform in this respect. Your Tag_Stripper takes input from files directly: >>> Tag_Stripper ('name_of_file.htm', 'name_of_output_file') 'name_of_output_file' Or if you want to to view the output: >>> Tag_Stripper ('name_of_file.htm', '') (... your text without tags ...) If you want to keep the definitions for later use, do this: >>> Tag_Stripper.save ('[your_path/]tag_stripper.se') Your definitions are now saved in the file 'tag_stripper.se'. You can edit that file. The next time you need a Tag_Stripper you can make it simply by naming the file: >>> Tag_Stripper = SE.SE ('[your_path/]tag_stripper.se') You can easily expand the capabilities of your Tag_Stripper. If, for instance, you want to translate the ampersand escapes ( etc.) you'd simply add the name of the file that defines the ampersand replacements: >>> Tag_Stripper = SE.SE ('tag_stripper.se htm2iso.se') 'htm2iso.se' comes with the SE package ready to use and as an example for writing ones own replacement sets. Frederic ----- Original Message ----- From: "Simon Forman" <[EMAIL PROTECTED]> Newsgroups: comp.lang.python To: <python-list@python.org> Sent: Friday, August 25, 2006 7:09 AM Subject: Re: RE Module > Roman wrote: > > I am trying to filter a column in a list of all html tags. > > What? > > > To do that, I have setup the following statement. > > > > row[0] = re.sub(r'<.*?>', '', row[0]) > > > > The results I get are sporatic. Sometimes two tags are removed. > > Sometimes 1 tag is removed. Sometimes no tags are removed. Could > > somebody tell me where have I gone wrong here? > > > > Thanks in advance > > I'm no re expert, so I won't try to advise you on your re, but it might > help those who are if you gave examples of your input and output data. > What results are you getting for what input strings. > > Also, if you're just trying to strip html markup to get plain text from > a file, "w3m -dump some.html" works great. ;-) > > HTH, > ~Simon > > -- > http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list