Jeremiah Dodds wrote:
On Wed, Apr 1, 2009 at 8:25 AM, Gabriel Rossetti
<gabriel.rosse...@arimaz.com <mailto:gabriel.rosse...@arimaz.com>> wrote:
Hello everyone,
I am using beautiful soup to parse some HTML and I came across
something strange.
Here is an illustration:
>>> soup = BeautifulSoup(u'<div class="text">hello ça boume<br
/></div')
>>> soup
<div class="text">hello ça boume<br /></div>
>>> soup.find("div", "text")
<div class="text">hello ça boume<br /></div>
>>> soup.find("div", "text").string
>>> soup.find("div", "text").next
u'hello \xe7a boume'
why does soup.find("div", "text").string not give me the string?
Is it because there is a <br/>?
IIRC, yes it is, and there's not much you can do about it other than
use .next.string or .contents[0] or stripping out brs. See
http://www.crummy.com/software/BeautifulSoup/documentation.html ,
particularly the "Removing Elements" and "string" sections.
Ok, thanks, I also found that I can do this :
soup.find(text=lambda t: isinstance(t, basestring))
or this:
soup.find(text=True)
it seems faster than doing this :
[br.extract() for br in soup.findAll("br")]
soup.string
but I may be wrong.
Thanks again!
Gabriel
--
http://mail.python.org/mailman/listinfo/python-list