En Fri, 11 Dec 2009 04:04:38 -0300, Johann Spies <jsp...@sun.ac.za> escribió:

Gabriel Genellina het geskryf:
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies <jsp...@sun.ac.za> escribió:

How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for <img src=icons/group.png>&nbsp;<a
href=#OBJ_sunetint>sunetint</A><BR>

and still provide the text-parts in the <td>'s with plain text?

Hard to tell if we don't see what's inside those <td>'s - please provide at least a few rows of the original HTML table.

Thanks for your reply. Here are a few lines:

<!------- Rule 1 ------->
<tr style="background-color: #ffffff"><td class=normal>2</td><td><img src=icons/usrgroup.png>&nbsp;All us...@any<br><td><im$ </td><td><img src=icons/any.png>&nbsp;Any<br></td><td><img src=icons/clientencrypt.png>&nbsp;clientencrypt</td><td><img src$
&nbsp;</td><td>&nbsp;</td></tr>

I *think* I finally understand what you want (your previous example above confused me).
If you want for Rule 1 to generate a line like this:

2,All us...@any,<im$,Any,clientencrypt,,

this code should serve as a starting point:

lines = []
soup = BeautifulSoup(html)
for table in soup.findAll("table"):
 for row in table.findAll("tr"):
  line = []
  for cell in row.findAll("td"):
    text = ' '.join(
        s.replace('\n',' ').replace('&nbsp;',' ')
        for s in cell.findAll(text=True)).strip()
    line.append(text)
  lines.append(line)

import csv
with open("output.csv","wb") as f:
  writer = csv.writer(f)
  writer.writerows(lines)

cell.findAll(text=True) returns a list of all text nodes inside a <td> cell; I preprocess all \n and &nbsp; in each text node, and join them all. lines is a list of lists (each entry one cell), as expected by the csv module used to write the output file.

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to