Hi, I'm in the process of cleaning some html files with BeautifulSoup and I want to remove all traces of the tables. Here is the bit of the code that deals with tables:
def remove(soup, tagname): for tag in soup.findAll(tagname): contents = tag.contents parent = tag.parent tag.extract() for tag in contents: parent.append(tag) remove(soup, "table") remove(soup, "tr") remove(soup, "td") It works fine but leaves an empty table structure at the end of the soup. Like: <table> <tr> <td></td> </tr> <tr> <td></td> </tr> <tr> ... And the extract method of BeautifulSoup seems the extract only what is in the tags. So I'm just looking for a quick and dirty way to remove this table structure at the end of the documents. I'm thinking with re but there must be a way to do it with BeautifulSoup, maybe I'm missing something. An other thing that makes me wonder, this code: for script in soup("script"): soup.script.extract() Works fine and remove script tags, but: for table in soup("table"): soup.table.extract() Raises AttributeError: 'NoneType' object has no attribute 'extract' Oh, and BTW, when I extract script tags this way, all the tag is gone, like I want it, it doesn't only removes the content of the tag. Thanks in advance -- http://mail.python.org/mailman/listinfo/python-list