Re: Beautiful Soup - close tags more promptly?

2022-10-25 Thread Chris Angelico
On Wed, 26 Oct 2022 at 04:59, Tim Delaney wrote: > > On Mon, 24 Oct 2022 at 19:03, Chris Angelico wrote: >> >> >> Ah, cool. Thanks. I'm not entirely sure of the various advantages and >> disadvantages of the different parsers; is there a tabulation >> anywhere, or at least a list of recommendatio

Re: Beautiful Soup - close tags more promptly?

2022-10-25 Thread Tim Delaney
On Mon, 24 Oct 2022 at 19:03, Chris Angelico wrote: > > Ah, cool. Thanks. I'm not entirely sure of the various advantages and > disadvantages of the different parsers; is there a tabulation > anywhere, or at least a list of recommendations on choosing a suitable > parser? > Coming to this a bit

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Tue, 25 Oct 2022 at 09:34, Peter J. Holzer wrote: > > One thing I find quite interesting, though, is the way that browsers > > *differ* in the face of bad nesting of tags. Recently I was struggling > > to figure out a problem with an HTML form, and eventually found that > > there was a spurious

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-25 06:56:58 +1100, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer wrote: > > There may be several reasons: > > > > * Historically, some browsers differed in which end tags were actually > > optional. Since (AFAIK) no mainstream browser ever implemented a real >

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer wrote: > There may be several reasons: > > * Historically, some browsers differed in which end tags were actually > optional. Since (AFAIK) no mainstream browser ever implemented a real > SGML parser (they were always "tag soup" parsers with lots o

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Roel Schroeven
Jon Ribbens via Python-list schreef op 24/10/2022 om 19:01: On 2022-10-24, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list wrote: >> Adding in the omitted , , , , and >> would make no difference and there's no particular reason to recommend >> doing so as fa

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-25 03:09:33 +1100, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list > wrote: > > On 2022-10-24, Chris Angelico wrote: > > > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: > > >> Yes, I got that. What I wanted to say was that this is indeed a bug

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Jon Ribbens via Python-list
On 2022-10-24, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list > wrote: >> >> On 2022-10-24, Chris Angelico wrote: >> > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: >> >> Yes, I got that. What I wanted to say was that this is indeed a bug in >> >> html.p

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list wrote: > > On 2022-10-24, Chris Angelico wrote: > > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: > >> Yes, I got that. What I wanted to say was that this is indeed a bug in > >> html.parser and not an error (or sloppyness, as you

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Jon Ribbens via Python-list
On 2022-10-24, Chris Angelico wrote: > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: >> Yes, I got that. What I wanted to say was that this is indeed a bug in >> html.parser and not an error (or sloppyness, as you called it) in the >> input or ambiguity in the HTML standard. > > I describe

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: > > On 2022-10-24 21:56:13 +1100, Chris Angelico wrote: > > On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote: > > > Ron has already noted that the lxml and html5 parser do the right thing, > > > so just for the record: > > > > > > The HTML f

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-24 21:56:13 +1100, Chris Angelico wrote: > On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote: > > Ron has already noted that the lxml and html5 parser do the right thing, > > so just for the record: > > > > The HTML fragment above is well-formed and contains a number of li > > element

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote: > Ron has already noted that the lxml and html5 parser do the right thing, > so just for the record: > > The HTML fragment above is well-formed and contains a number of li > elements at the same level directly below the ol element, not lots of >

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-24 12:32:11 +0200, Peter J. Holzer wrote: > Ron has already noted that the lxml and html5 parser do the right thing, ^^^ Oops, sorry. That was Roel. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-24 13:29:13 +1100, Chris Angelico wrote: > Parsing ancient HTML files is something Beautiful Soup is normally > great at. But I've run into a small problem, caused by this sort of > sloppy HTML: > > from bs4 import BeautifulSoup > # See: https://gsarchive.net/gilbert/plays/princess/tenn

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Roel Schroeven
(Oops, accidentally only sent to Chris instead of to the list) Op 24/10/2022 om 10:02 schreef Chris Angelico: On Mon, 24 Oct 2022 at 18:43, Roel Schroeven wrote: > Using html5lib (install package html5lib) instead of html.parser seems > to do the trick: it inserts right before the next , and

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Roel Schroeven
Op 24/10/2022 om 9:42 schreef Roel Schroeven: Using html5lib (install package html5lib) instead of html.parser seems to do the trick: it inserts right before the next , and one before the closing . On my system the same happens when I don't specify a parser, but IIRC that's a bit fragile beca

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Mon, 24 Oct 2022 at 18:43, Roel Schroeven wrote: > > Op 24/10/2022 om 4:29 schreef Chris Angelico: > > Parsing ancient HTML files is something Beautiful Soup is normally > > great at. But I've run into a small problem, caused by this sort of > > sloppy HTML: > > > > from bs4 import BeautifulSou

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Roel Schroeven
Op 24/10/2022 om 4:29 schreef Chris Angelico: Parsing ancient HTML files is something Beautiful Soup is normally great at. But I've run into a small problem, caused by this sort of sloppy HTML: from bs4 import BeautifulSoup # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm

Re: beautiful soup get class info

2014-03-12 Thread Peter Otten
Christopher Welborn wrote: > On 03/06/2014 02:22 PM, teddyb...@gmail.com wrote: >> I am using beautifulsoup to get the title and date of the website. >> title is working fine but I am not able to pull the date. Here is the >> code in the url: >> >> October 22, 2011 >> >> In Python, I am using th

Re: beautiful soup get class info

2014-03-11 Thread Christopher Welborn
On 03/06/2014 02:22 PM, teddyb...@gmail.com wrote: I am using beautifulsoup to get the title and date of the website. title is working fine but I am not able to pull the date. Here is the code in the url: October 22, 2011 In Python, I am using the following code: date1 = soup.span.text data=

Re: beautiful soup get class info

2014-03-06 Thread Mark Lawrence
On 07/03/2014 01:37, teddyb...@gmail.com wrote: On Thursday, March 6, 2014 4:28:06 PM UTC-6, John Gordon wrote: In writes: soup.find_all(name="span", class="date") I have python 2.7.2 and it does not like class in the code you provided. Oh right, 'class' is a reserved word. I ima

Re: beautiful soup get class info

2014-03-06 Thread teddybubu
On Thursday, March 6, 2014 4:28:06 PM UTC-6, John Gordon wrote: > In writes: > > > > > > soup.find_all(name="span", class="date") > > > > > I have python 2.7.2 and it does not like class in the code you provided. > > > > Oh right, 'class' is a reserved word. I imagine beautifulsoup has

Re: beautiful soup get class info

2014-03-06 Thread John Gordon
In teddyb...@gmail.com writes: > > soup.find_all(name="span", class="date") > I have python 2.7.2 and it does not like class in the code you provided. Oh right, 'class' is a reserved word. I imagine beautifulsoup has a workaround for that. > Now when I take out [ class="date"], this is retur

Re: beautiful soup get class info

2014-03-06 Thread teddybubu
On Thursday, March 6, 2014 2:58:12 PM UTC-6, John Gordon wrote: > In teddy writes: > > > > > October 22, 2011 > > > > > date1 = soup.span.text > > > data=soup.find_all(date="value") > > > > Try this: > > > > soup.find_all(name="span", class="date") > > > > -- > > John Gordon

Re: beautiful soup get class info

2014-03-06 Thread John Gordon
In teddyb...@gmail.com writes: > October 22, 2011 > date1 = soup.span.text > data=soup.find_all(date="value") Try this: soup.find_all(name="span", class="date") -- John Gordon Imagine what it must be like for a real medical doctor to gor...@panix.comwatch 'House', or a real se

Re: Beautiful Soup Table Parsing

2012-08-09 Thread Andreas Perstinger
On 09.08.2012 01:58, Tom Russell wrote: For instance this code below: soup = BeautifulSoup(urlopen('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar')) table = soup.find("table",{"class": "mdcTable"}) for row in table.findAll("tr"): for cell in row.find

Re: Beautiful Soup Table Parsing

2012-08-08 Thread Dieter Maurer
Tom Russell writes: > I am parsing out a web page at > http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar > using BeautifulSoup. > > My problem is that I can parse into the table where the data I want > resides but I cannot seem to figure out how to go about grab

Re: beautiful soup

2009-07-31 Thread John Nagle
xubin.cz wrote: hi, everyone Is there any pakage or module handling html document like beautiful soup? Try "http://code.google.com/p/html5lib/";. That's supposed to be a parser which complies with the HTML 5 specification, including its rules for handling bad HTML.

Re: beautiful soup

2009-07-30 Thread Masklinn
On 30 Jul 2009, at 09:30 , Diez B. Roggisch wrote: xubin.cz schrieb: hi, everyone Is there any pakage or module handling html document like beautiful soup? why don't you *use* beautiful soup? It is a module... Or lxml, which works a bit better than BF 3.1 (post parser change) nowadays. -- h

Re: beautiful soup

2009-07-30 Thread Diez B. Roggisch
xubin.cz schrieb: hi, everyone Is there any pakage or module handling html document like beautiful soup? why don't you *use* beautiful soup? It is a module... Diez -- http://mail.python.org/mailman/listinfo/python-list

Re: Beautiful soup : why does "string" not give me the string?

2009-04-01 Thread Gabriel Rossetti
Jeremiah Dodds wrote: On Wed, Apr 1, 2009 at 8:25 AM, Gabriel Rossetti mailto:gabriel.rosse...@arimaz.com>> wrote: Hello everyone, I am using beautiful soup to parse some HTML and I came across something strange. Here is an illustration: >>> soup = BeautifulSoup(u'hello

Re: Beautiful soup : why does "string" not give me the string?

2009-04-01 Thread Jeremiah Dodds
On Wed, Apr 1, 2009 at 8:25 AM, Gabriel Rossetti < gabriel.rosse...@arimaz.com> wrote: > Hello everyone, > > I am using beautiful soup to parse some HTML and I came across something > strange. > Here is an illustration: > > >>> soup = BeautifulSoup(u'hello ça boume >>> soup > hello ça boume > >>>

Re: Beautiful soup tag attributes - Dictionary?

2008-11-30 Thread Chris Rebert
On Sun, Nov 30, 2008 at 8:51 PM, killsto <[EMAIL PROTECTED]> wrote: > The documentation says I can find attributes of tags by using it as a > dictionary. Ex: > > product = p.findAll('dd') .findAll() produces a *list* of tags The example in the docs is: firstPTag, secondPTag = soup.findAll('p') w

Re: Beautiful Soup Looping Extraction Question

2008-03-25 Thread Stefan Behnel
Hi, again, not BS related, but still a solution. Tess wrote: > Let's say I have a file that looks at file.html pasted below. > > My goal is to extract all elements where the following is true: align="left"> and . Using lxml: from lxml import html tree = html.parse("file.html") for el in

Re: Beautiful Soup Looping Extraction Question

2008-03-25 Thread Tess
Paul - you are very right. I am back to the drawing board. Tess -- http://mail.python.org/mailman/listinfo/python-list

Re: Beautiful Soup Looping Extraction Question

2008-03-24 Thread Paul McGuire
On Mar 24, 7:56 pm, Tess <[EMAIL PROTECTED]> wrote: > > Anyhow, a simple regex took care of the issue in BS: > > for i in soup.findAll(re.compile('^p|^div'),align=re.compile('^center| > ^left')): >     print i > But I thought you only wanted certain combinations: "My goal is to extract all elem

Re: Beautiful Soup Looping Extraction Question

2008-03-24 Thread Tess
Paul - thanks for the input, it's interesting to see how pyparser handles it. Anyhow, a simple regex took care of the issue in BS: for i in soup.findAll(re.compile('^p|^div'),align=re.compile('^center| ^left')): print i Thanks again! T -- http://mail.python.org/mailman/listinfo/python-li

Re: Beautiful Soup Looping Extraction Question

2008-03-24 Thread Paul McGuire
On Mar 24, 6:32 pm, Tess <[EMAIL PROTECTED]> wrote: > Hello All, > > I have a Beautiful Soup question and I'd appreciate any guidance the > forum can provide. > I *know* you're using Beautiful Soup, and I *know* that BS is the de facto HTML parser/processor library. Buut, I just couldn't help

Re: Beautiful Soup iterator question....

2007-04-20 Thread Paul McGuire
On Apr 20, 2:05 pm, Steve Holden <[EMAIL PROTECTED]> wrote: > > did you try something like (untested) > > cell1, cell2, cell3, cell4, cell5, \ > cell6, cell7, cell8 = row.findAll("td") > > No need for the "for" if you want to handle each cell differently, you > won;t be iterating o

Re: Beautiful Soup iterator question....

2007-04-20 Thread Steve Holden
cjl wrote: > P: > > I am screen-scraping a table. The table has an unknown number of rows, > but each row has exactly 8 cells. I would like to extract the data > from the cells, but the first three cells in each row have their data > nested inside other tags. > > So I have the following code: >

Re: Beautiful Soup Question: Filtering Images based on their width and height attributes

2006-12-04 Thread David Coffin
> Hello, > > I want to extract some image links from different html pages, in > particular i want extract those image tags which height values are > greater than 200. Is there an elegant way in BeautifulSoup to do this? Yes. soup.findAll(lambda tag: tag.name=="img" and tag.has_key("height") an

Re: Beautiful Soup Question: Filtering Images based on their width and height attributes

2006-11-30 Thread Fredrik Lundh
Chris Mellon wrote: >> I want to extract some image links from different html pages, in >> particular i want extract those image tags which height values are >> greater than 200. Is there an elegant way in BeautifulSoup to do this? > > Most image tags "in the wild" don't have height attributes, y

Re: Beautiful Soup Question: Filtering Images based on their width and height attributes

2006-11-30 Thread Chris Mellon
On 30 Nov 2006 12:43:45 -0800, PicURLPy <[EMAIL PROTECTED]> wrote: > Hello, > > I want to extract some image links from different html pages, in > particular i want extract those image tags which height values are > greater than 200. Is there an elegant way in BeautifulSoup to do this? > Most imag

Re: beautiful soup library question

2006-03-10 Thread Enigma Curry
Here's how I print each line after the 's: import BeautifulSoup as Soup page=open("test.html").read() soup=Soup.BeautifulSoup(page) for br in soup.fetch('br'): print br.next -- http://mail.python.org/mailman/listinfo/python-list

Re: beautiful soup library question

2006-03-10 Thread Erik Max Francis
[EMAIL PROTECTED] wrote: > I'm trying to extract some information from an html file using > beautiful soup. The strings I want get are after br tags, eg: > > > this info > more info > and more info > > > I can navigate to the first br tag using find_next_sibling, but how do > I ge