Thanks for your help. A thing I didn't mention is that before the statement row[0] = re.sub(r'<.*?>', '', row[0]), I have row[0]=re.sub(r'[^ 0-9A-Za-z\"\'\.\,[EMAIL PROTECTED](\)\*\&\%\%\\\/\:\;\?\`\~\<\>]', '', row[0]) statement. Hence, the line separators are going to be gone. You mentioned the size of the string could be a factor. If so what is the max size before I see problems?
Thanks again Anthra Norell wrote: > Roman, > > Your re works for me. I suspect you have tags spanning lines, a thing you get > more often than not. If so, processing linewise > doesn't work. You need to catch the tags like this: > > >>> text = re.sub ('<(.|\n)*?>', '', text) > > If your text is reasonably small I would recommend this solution. Else you > might want to take a look at SE which is a stream edtor > that does the buffering for you: > > http://cheeseshop.python.org/pypi/SE/2.2%20beta > > >>> import SE > >>> Tag_Stripper = SE.SE (' "~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~=" ') > >>> print Tag_Stripper (text) > (... your text without tags ...) > > The Tag_Stripper is made up of two regexes. The second one catches comments > which may nest tags. The first expression alone would > also catch comments, but would mistake the '>' of the first nested tag for > the end of the comment and quit prematurely. The example > "re.sub ('<(.|\n)*?>', '', text)" above would misperform in this respect. > > Your Tag_Stripper takes input from files directly: > > >>> Tag_Stripper ('name_of_file.htm', 'name_of_output_file') > 'name_of_output_file' > > Or if you want to to view the output: > > >>> Tag_Stripper ('name_of_file.htm', '') > (... your text without tags ...) > > If you want to keep the definitions for later use, do this: > > >>> Tag_Stripper.save ('[your_path/]tag_stripper.se') > > Your definitions are now saved in the file 'tag_stripper.se'. You can edit > that file. The next time you need a Tag_Stripper you can > make it simply by naming the file: > > >>> Tag_Stripper = SE.SE ('[your_path/]tag_stripper.se') > > You can easily expand the capabilities of your Tag_Stripper. If, for > instance, you want to translate the ampersand escapes ( > etc.) you'd simply add the name of the file that defines the ampersand > replacements: > > >>> Tag_Stripper = SE.SE ('tag_stripper.se htm2iso.se') > > 'htm2iso.se' comes with the SE package ready to use and as an example for > writing ones own replacement sets. > > > Frederic > > > ----- Original Message ----- > From: "Simon Forman" <[EMAIL PROTECTED]> > Newsgroups: comp.lang.python > To: <python-list@python.org> > Sent: Friday, August 25, 2006 7:09 AM > Subject: Re: RE Module > > > > Roman wrote: > > > I am trying to filter a column in a list of all html tags. > > > > What? > > > > > To do that, I have setup the following statement. > > > > > > row[0] = re.sub(r'<.*?>', '', row[0]) > > > > > > The results I get are sporatic. Sometimes two tags are removed. > > > Sometimes 1 tag is removed. Sometimes no tags are removed. Could > > > somebody tell me where have I gone wrong here? > > > > > > Thanks in advance > > > > I'm no re expert, so I won't try to advise you on your re, but it might > > help those who are if you gave examples of your input and output data. > > What results are you getting for what input strings. > > > > Also, if you're just trying to strip html markup to get plain text from > > a file, "w3m -dump some.html" works great. ;-) > > > > HTH, > > ~Simon > > > > -- > > http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list