On Wed, 26 Oct 2022 at 04:59, Tim Delaney wrote:
>
> On Mon, 24 Oct 2022 at 19:03, Chris Angelico wrote:
>>
>>
>> Ah, cool. Thanks. I'm not entirely sure of the various advantages and
>> disadvantages of the different parsers; is there a tabulation
>> anywhere, or at least a list of recommendatio
On Mon, 24 Oct 2022 at 19:03, Chris Angelico wrote:
>
> Ah, cool. Thanks. I'm not entirely sure of the various advantages and
> disadvantages of the different parsers; is there a tabulation
> anywhere, or at least a list of recommendations on choosing a suitable
> parser?
>
Coming to this a bit
On Tue, 25 Oct 2022 at 09:34, Peter J. Holzer wrote:
> > One thing I find quite interesting, though, is the way that browsers
> > *differ* in the face of bad nesting of tags. Recently I was struggling
> > to figure out a problem with an HTML form, and eventually found that
> > there was a spurious
On 2022-10-25 06:56:58 +1100, Chris Angelico wrote:
> On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer wrote:
> > There may be several reasons:
> >
> > * Historically, some browsers differed in which end tags were actually
> > optional. Since (AFAIK) no mainstream browser ever implemented a real
>
On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer wrote:
> There may be several reasons:
>
> * Historically, some browsers differed in which end tags were actually
> optional. Since (AFAIK) no mainstream browser ever implemented a real
> SGML parser (they were always "tag soup" parsers with lots o
Jon Ribbens via Python-list schreef op 24/10/2022 om 19:01:
On 2022-10-24, Chris Angelico wrote:
> On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
wrote:
>> Adding in the omitted , , , , and
>> would make no difference and there's no particular reason to recommend
>> doing so as fa
On 2022-10-25 03:09:33 +1100, Chris Angelico wrote:
> On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
> wrote:
> > On 2022-10-24, Chris Angelico wrote:
> > > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote:
> > >> Yes, I got that. What I wanted to say was that this is indeed a bug
On 2022-10-24, Chris Angelico wrote:
> On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
> wrote:
>>
>> On 2022-10-24, Chris Angelico wrote:
>> > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote:
>> >> Yes, I got that. What I wanted to say was that this is indeed a bug in
>> >> html.p
On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
wrote:
>
> On 2022-10-24, Chris Angelico wrote:
> > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote:
> >> Yes, I got that. What I wanted to say was that this is indeed a bug in
> >> html.parser and not an error (or sloppyness, as you
On 2022-10-24, Chris Angelico wrote:
> On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote:
>> Yes, I got that. What I wanted to say was that this is indeed a bug in
>> html.parser and not an error (or sloppyness, as you called it) in the
>> input or ambiguity in the HTML standard.
>
> I describe
On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote:
>
> On 2022-10-24 21:56:13 +1100, Chris Angelico wrote:
> > On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote:
> > > Ron has already noted that the lxml and html5 parser do the right thing,
> > > so just for the record:
> > >
> > > The HTML f
On 2022-10-24 21:56:13 +1100, Chris Angelico wrote:
> On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote:
> > Ron has already noted that the lxml and html5 parser do the right thing,
> > so just for the record:
> >
> > The HTML fragment above is well-formed and contains a number of li
> > element
On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote:
> Ron has already noted that the lxml and html5 parser do the right thing,
> so just for the record:
>
> The HTML fragment above is well-formed and contains a number of li
> elements at the same level directly below the ol element, not lots of
>
On 2022-10-24 12:32:11 +0200, Peter J. Holzer wrote:
> Ron has already noted that the lxml and html5 parser do the right thing,
^^^
Oops, sorry. That was Roel.
hp
--
_ | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| | | h...@hjp.
On 2022-10-24 13:29:13 +1100, Chris Angelico wrote:
> Parsing ancient HTML files is something Beautiful Soup is normally
> great at. But I've run into a small problem, caused by this sort of
> sloppy HTML:
>
> from bs4 import BeautifulSoup
> # See: https://gsarchive.net/gilbert/plays/princess/tenn
(Oops, accidentally only sent to Chris instead of to the list)
Op 24/10/2022 om 10:02 schreef Chris Angelico:
On Mon, 24 Oct 2022 at 18:43, Roel Schroeven
wrote:
> Using html5lib (install package html5lib) instead of html.parser seems
> to do the trick: it inserts right before the next , and
Op 24/10/2022 om 9:42 schreef Roel Schroeven:
Using html5lib (install package html5lib) instead of html.parser seems
to do the trick: it inserts right before the next , and one
before the closing . On my system the same happens when I don't
specify a parser, but IIRC that's a bit fragile beca
On Mon, 24 Oct 2022 at 18:43, Roel Schroeven wrote:
>
> Op 24/10/2022 om 4:29 schreef Chris Angelico:
> > Parsing ancient HTML files is something Beautiful Soup is normally
> > great at. But I've run into a small problem, caused by this sort of
> > sloppy HTML:
> >
> > from bs4 import BeautifulSou
Op 24/10/2022 om 4:29 schreef Chris Angelico:
Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:
from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
Christopher Welborn wrote:
> On 03/06/2014 02:22 PM, teddyb...@gmail.com wrote:
>> I am using beautifulsoup to get the title and date of the website.
>> title is working fine but I am not able to pull the date. Here is the
>> code in the url:
>>
>> October 22, 2011
>>
>> In Python, I am using th
On 03/06/2014 02:22 PM, teddyb...@gmail.com wrote:
I am using beautifulsoup to get the title and date of the website.
title is working fine but I am not able to pull the date. Here is the code in
the url:
October 22, 2011
In Python, I am using the following code:
date1 = soup.span.text
data=
On 07/03/2014 01:37, teddyb...@gmail.com wrote:
On Thursday, March 6, 2014 4:28:06 PM UTC-6, John Gordon wrote:
In writes:
soup.find_all(name="span", class="date")
I have python 2.7.2 and it does not like class in the code you provided.
Oh right, 'class' is a reserved word. I ima
On Thursday, March 6, 2014 4:28:06 PM UTC-6, John Gordon wrote:
> In writes:
>
>
>
> > > soup.find_all(name="span", class="date")
>
>
>
> > I have python 2.7.2 and it does not like class in the code you provided.
>
>
>
> Oh right, 'class' is a reserved word. I imagine beautifulsoup has
In teddyb...@gmail.com
writes:
> > soup.find_all(name="span", class="date")
> I have python 2.7.2 and it does not like class in the code you provided.
Oh right, 'class' is a reserved word. I imagine beautifulsoup has
a workaround for that.
> Now when I take out [ class="date"], this is retur
On Thursday, March 6, 2014 2:58:12 PM UTC-6, John Gordon wrote:
> In teddy writes:
>
>
>
> > October 22, 2011
>
>
>
> > date1 = soup.span.text
>
> > data=soup.find_all(date="value")
>
>
>
> Try this:
>
>
>
> soup.find_all(name="span", class="date")
>
>
>
> --
>
> John Gordon
In teddyb...@gmail.com
writes:
> October 22, 2011
> date1 = soup.span.text
> data=soup.find_all(date="value")
Try this:
soup.find_all(name="span", class="date")
--
John Gordon Imagine what it must be like for a real medical doctor to
gor...@panix.comwatch 'House', or a real se
On 09.08.2012 01:58, Tom Russell wrote:
For instance this code below:
soup =
BeautifulSoup(urlopen('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar'))
table = soup.find("table",{"class": "mdcTable"})
for row in table.findAll("tr"):
for cell in row.find
Tom Russell writes:
> I am parsing out a web page at
> http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar
> using BeautifulSoup.
>
> My problem is that I can parse into the table where the data I want
> resides but I cannot seem to figure out how to go about grab
xubin.cz wrote:
hi, everyone
Is there any pakage or module handling html document like beautiful
soup?
Try "http://code.google.com/p/html5lib/";.
That's supposed to be a parser which complies with the HTML 5 specification,
including its rules for handling bad HTML.
On 30 Jul 2009, at 09:30 , Diez B. Roggisch wrote:
xubin.cz schrieb:
hi, everyone
Is there any pakage or module handling html document like beautiful
soup?
why don't you *use* beautiful soup? It is a module...
Or lxml, which works a bit better than BF 3.1 (post parser change)
nowadays.
--
h
xubin.cz schrieb:
hi, everyone
Is there any pakage or module handling html document like beautiful
soup?
why don't you *use* beautiful soup? It is a module...
Diez
--
http://mail.python.org/mailman/listinfo/python-list
Jeremiah Dodds wrote:
On Wed, Apr 1, 2009 at 8:25 AM, Gabriel Rossetti
mailto:gabriel.rosse...@arimaz.com>> wrote:
Hello everyone,
I am using beautiful soup to parse some HTML and I came across
something strange.
Here is an illustration:
>>> soup = BeautifulSoup(u'hello
On Wed, Apr 1, 2009 at 8:25 AM, Gabriel Rossetti <
gabriel.rosse...@arimaz.com> wrote:
> Hello everyone,
>
> I am using beautiful soup to parse some HTML and I came across something
> strange.
> Here is an illustration:
>
> >>> soup = BeautifulSoup(u'hello ça boume >>> soup
> hello ça boume
> >>>
On Sun, Nov 30, 2008 at 8:51 PM, killsto <[EMAIL PROTECTED]> wrote:
> The documentation says I can find attributes of tags by using it as a
> dictionary. Ex:
>
> product = p.findAll('dd')
.findAll() produces a *list* of tags
The example in the docs is:
firstPTag, secondPTag = soup.findAll('p')
w
Hi,
again, not BS related, but still a solution.
Tess wrote:
> Let's say I have a file that looks at file.html pasted below.
>
> My goal is to extract all elements where the following is true: align="left"> and .
Using lxml:
from lxml import html
tree = html.parse("file.html")
for el in
Paul - you are very right. I am back to the drawing board. Tess
--
http://mail.python.org/mailman/listinfo/python-list
On Mar 24, 7:56 pm, Tess <[EMAIL PROTECTED]> wrote:
>
> Anyhow, a simple regex took care of the issue in BS:
>
> for i in soup.findAll(re.compile('^p|^div'),align=re.compile('^center|
> ^left')):
> print i
>
But I thought you only wanted certain combinations:
"My goal is to extract all elem
Paul - thanks for the input, it's interesting to see how pyparser
handles it.
Anyhow, a simple regex took care of the issue in BS:
for i in soup.findAll(re.compile('^p|^div'),align=re.compile('^center|
^left')):
print i
Thanks again!
T
--
http://mail.python.org/mailman/listinfo/python-li
On Mar 24, 6:32 pm, Tess <[EMAIL PROTECTED]> wrote:
> Hello All,
>
> I have a Beautiful Soup question and I'd appreciate any guidance the
> forum can provide.
>
I *know* you're using Beautiful Soup, and I *know* that BS is the de
facto HTML parser/processor library. Buut, I just couldn't help
On Apr 20, 2:05 pm, Steve Holden <[EMAIL PROTECTED]> wrote:
>
> did you try something like (untested)
>
> cell1, cell2, cell3, cell4, cell5, \
> cell6, cell7, cell8 = row.findAll("td")
>
> No need for the "for" if you want to handle each cell differently, you
> won;t be iterating o
cjl wrote:
> P:
>
> I am screen-scraping a table. The table has an unknown number of rows,
> but each row has exactly 8 cells. I would like to extract the data
> from the cells, but the first three cells in each row have their data
> nested inside other tags.
>
> So I have the following code:
>
> Hello,
>
> I want to extract some image links from different html pages, in
> particular i want extract those image tags which height values are
> greater than 200. Is there an elegant way in BeautifulSoup to do this?
Yes.
soup.findAll(lambda tag: tag.name=="img" and tag.has_key("height")
an
Chris Mellon wrote:
>> I want to extract some image links from different html pages, in
>> particular i want extract those image tags which height values are
>> greater than 200. Is there an elegant way in BeautifulSoup to do this?
>
> Most image tags "in the wild" don't have height attributes, y
On 30 Nov 2006 12:43:45 -0800, PicURLPy <[EMAIL PROTECTED]> wrote:
> Hello,
>
> I want to extract some image links from different html pages, in
> particular i want extract those image tags which height values are
> greater than 200. Is there an elegant way in BeautifulSoup to do this?
>
Most imag
Here's how I print each line after the 's:
import BeautifulSoup as Soup
page=open("test.html").read()
soup=Soup.BeautifulSoup(page)
for br in soup.fetch('br'):
print br.next
--
http://mail.python.org/mailman/listinfo/python-list
[EMAIL PROTECTED] wrote:
> I'm trying to extract some information from an html file using
> beautiful soup. The strings I want get are after br tags, eg:
>
>
> this info
> more info
> and more info
>
>
> I can navigate to the first br tag using find_next_sibling, but how do
> I ge
46 matches
Mail list logo