Jeremiah Dodds wrote:


On Wed, Apr 1, 2009 at 8:25 AM, Gabriel Rossetti <gabriel.rosse...@arimaz.com <mailto:gabriel.rosse...@arimaz.com>> wrote:

    Hello everyone,

    I am using beautiful soup to parse some HTML and I came across
    something strange.
    Here is an illustration:

    >>> soup = BeautifulSoup(u'<div class="text">hello ça boume<br
    /></div')
    >>> soup
    <div class="text">hello ça boume<br /></div>
    >>> soup.find("div", "text")
    <div class="text">hello ça boume<br /></div>
    >>> soup.find("div", "text").string
    >>> soup.find("div", "text").next
    u'hello \xe7a boume'

    why does soup.find("div", "text").string not give me the string?
    Is it because there is a <br/>?


IIRC, yes it is, and there's not much you can do about it other than use .next.string or .contents[0] or stripping out brs. See http://www.crummy.com/software/BeautifulSoup/documentation.html , particularly the "Removing Elements" and "string" sections.


Ok, thanks, I also found that I can do this :

   soup.find(text=lambda t: isinstance(t, basestring))

or this:

   soup.find(text=True)

it seems faster than doing this :

   [br.extract() for br in soup.findAll("br")]
   soup.string

but I may be wrong.

Thanks again!
Gabriel
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to