On Mon, Oct 5, 2015 at 9:14 AM, Skip Montanaro wrote:
> I wouldn't be surprised if there were some small API changes other than the
> name change caused by the move into the xml package. Before I dive into a
> rabbit hole and start to modify elementtidy, is there some other stdlib-only
> way to pa
Back before Fredrik Lundh's elementtree module was sucked into the Python
stdlib as xml.etree, I used to use his elementtidy extension module to
clean up HTML source so it could be parsed into an ElementTree object.
Elementtidy hasn't be updated in about ten years, and still assumes there
is a modu
En Mon, 14 Dec 2009 03:58:34 -0300, Johann Spies
escribió:
On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote:
cell.findAll(text=True) returns a list of all text nodes inside a
cell; I preprocess all \n and in each text node, and
join them all. lines is a list of lists (each
On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote:
> this code should serve as a starting point:
Thank you very much!
> cell.findAll(text=True) returns a list of all text nodes inside a
> cell; I preprocess all \n and in each text node, and
> join them all. lines is a list of
En Fri, 11 Dec 2009 04:04:38 -0300, Johann Spies
escribió:
Gabriel Genellina het geskryf:
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies
escribió:
How do I get Beautifulsoup to render (taking the above line as
example)
sunentint for sunetint
and still provide the text-parts in the
Gabriel Genellina het geskryf:
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies
escribió:
How do I get Beautifulsoup to render (taking the above line as
example)
sunentint for sunetint
and still provide the text-parts in the 's with plain text?
Hard to tell if we don't see what's inside
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies
escribió:
How do I get Beautifulsoup to render (taking the above line as
example)
sunentint for sunetint
and still provide the text-parts in the 's with plain text?
Hard to tell if we don't see what's inside those 's - please provide
at
I am trying to get csv-output from a html-file.
With this code I had a little success:
=
from BeautifulSoup import BeautifulSoup
from string import replace, join
import re
f = open("configuration.html","r")
g = open("configuration.csv",'w')
soup = BeautifulSoup(f)
t = soup
ions should allow
you to do this:
import lxml.etree as et
parser = etree.HTMLParser()
tree = h.parse("somefile.html", parser)
text = tree.xpath("string( some/[EMAIL PROTECTED] )")
lxml.html is just a dedicated package that makes HTML handling beautiful. It's
not required for parsing HTML and doing general XML stuff with it.
Stefan
--
http://mail.python.org/mailman/listinfo/python-list
On Apr 6, 11:03 pm, Stefan Behnel <[EMAIL PROTECTED]> wrote:
> Benjamin wrote:
> > I'm trying to parse an HTML file. I want to retrieve all of the text
> > inside a certain tag that I find with XPath. The DOM seems to make
> > this available with the innerHTML element, but I haven't found a way
>
On Apr 3, 9:10 pm, 7stud <[EMAIL PROTECTED]> wrote:
> On Apr 3, 12:39 am, [EMAIL PROTECTED] wrote:
>
> > BeautifulSoup does what I need it to. Though, I was hoping to find
> > something that would let me work with the DOM the way JavaScript can
> > work with web browsers' implementations of the DO
Benjamin wrote:
> I'm trying to parse an HTML file. I want to retrieve all of the text
> inside a certain tag that I find with XPath. The DOM seems to make
> this available with the innerHTML element, but I haven't found a way
> to do it in Python.
import lxml.html as h
tree = h.parse("s
On Apr 3, 12:39 am, [EMAIL PROTECTED] wrote:
> BeautifulSoup does what I need it to. Though, I was hoping to find
> something that would let me work with the DOM the way JavaScript can
> work with web browsers' implementations of the DOM. Specifically, I'd
> like to be able to access the innerHTM
On Wed, 2008-04-02 at 21:59 -0700, Benjamin wrote:
> I'm trying to parse an HTML file. I want to retrieve all of the text
> inside a certain tag that I find with XPath. The DOM seems to make
> this available with the innerHTML element, but I haven't found a way
> to do it in Python.
I use Eleme
On 3 Apr, 06:59, Benjamin <[EMAIL PROTECTED]> wrote:
> I'm trying to parse an HTML file. I want to retrieve all of the text
> inside a certain tag that I find with XPath. The DOM seems to make
> this available with the innerHTML element, but I haven't found a way
> to do it in Python.
With libxm
BeautifulSoup does what I need it to. Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLPar
> I'm trying to parse an HTML file. I want to retrieve all of the text
> inside a certain tag that I find with XPath. The DOM seems to make
> this available with the innerHTML element, but I haven't found a way
> to do it in Python.
Have you tried http://www.google.com/search?q=python+html+parse
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
--
http://mail.python.org/mailman/listinfo/python-list
codespeak.net/lxml/dev/tutorial.html
http://codespeak.net/lxml/dev/parsing.html#parsing-html
http://codespeak.net/lxml/dev/xpathxslt.html#xpath
Stefan
--
http://mail.python.org/mailman/listinfo/python-list
I see there is a couple of tools I could use, and I also heard of
sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib,
htmllib ...
Is there any of those tools that does the job I need to do more easily
and what should I use? Maybe a combination of those tools, which one
is better fo
>
> http://codespeak.net/lxml/dev/parsing.html#parsing-html
I stand corrected, I missed that whole part of the LXML documentation :-)
--
http://mail.python.org/mailman/listinfo/python-list
>
> http://codespeak.net/lxml/dev/parsing.html#parsing-html
I stand corrected, I missed that whole part of the LXML documentation :-)
--
http://mail.python.org/mailman/listinfo/python-list
Jay Loden wrote:
> Someone else mentioned lxml but as I understand it lxml will only work if
> it's valid XHTML that they're working with.
No, it was meant as the OP requested. It even has a very good parser from
broken HTML.
http://codespeak.net/lxml/dev/parsing.html#pars
Neil Cerutti wrote:
> You could get good results, and save yourself some effort, using
> links or lynx with the command line options to dump page text to
> a file. Python would still be needed to automate calling links or
> lynx on all your documents.
OP was looking for a way to parse out part of
On 2007-06-18, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> I work at this company and we are re-building our website: http://caslt.org/.
> The new website will be built by an external firm (I could do it
> myself, but since I'm just the summer student worker...). Anyways, to
> help them, they fi
Hi,
I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the sit
[EMAIL PROTECTED] wrote:
> I work at this company and we are re-building our website: http://caslt.org/.
> The new website will be built by an external firm (I could do it
> myself, but since I'm just the summer student worker...). Anyways, to
> help them, they first asked me to copy all the text f
[EMAIL PROTECTED] wrote:
> So, I'm writing this to have your opinion on what tools I should use
> to do this and what technique I should use.
Take a look at parsing example on this page:
http://wiki.python.org/moin/SimplePrograms
--
HTH,
Rob
--
http://mail.python.org/mailman/listinfo/python-l
Stefan Behnel skrev:
> [EMAIL PROTECTED] wrote:
>> I need to parse real world HTML/XML documents and I found two nice python
>> solution: BeautifulSoup and Tidy.
>
> There's also lxml, in case you want a real XML tool.
> http://codespeak.net/lxml/
> http://codespeak.net/lxml/dev/parsing.html#parse
[EMAIL PROTECTED] wrote:
> I need to parse real world HTML/XML documents and I found two nice python
> solution: BeautifulSoup and Tidy.
There's also lxml, in case you want a real XML tool.
http://codespeak.net/lxml/
http://codespeak.net/lxml/dev/parsing.html#parsers
> However I found pyXPCOM th
I need to parse real world HTML/XML documents and I found two nice python
solution: BeautifulSoup and Tidy.
However I found pyXPCOM that is a wrapper for Gecko. So I was thinking
Gecko surely handles bad html in a more consistent and error-proof way
than BS and Tidy.
I'm interested in using Mozil
BeautifulSoup does parse HTML well, but there are a few issues:
1. It's rather slow; it can take seconds of CPU time to parse
some larger web pages.
2. There's no error reporting. It tries to do the right thing,
but when it doesn't, you have no idea what went wrong.
BeautifulSoup
On Feb 8, 11:43 am, "metaperl" <[EMAIL PROTECTED]> wrote:
> On Feb 8, 2:38 pm, "mtuller" <[EMAIL PROTECTED]> wrote:
>
> > I am trying to parse a webpage and extract information.
>
> BeautifulSoup is a great Python module for this purpose:
>
>http://www.crummy.com/software/BeautifulSoup/
>
> Her
mtuller wrote:
> Alright. I have tried everything I can find, but am not getting
> anywhere. I have a web page that has data like this:
>
>
>
> LETTER
>
> 33,699
>
> 1.0
>
>
>
> What is show is only a small section.
>
> I want to extract the 33,699 (which is dynamic) and set the value to a
>
On Feb 10, 5:03 pm, "mtuller" <[EMAIL PROTECTED]> wrote:
> Alright. I have tried everything I can find, but am not getting
> anywhere. I have a web page that has data like this:
>
>
>
> LETTER
>
> 33,699
>
> 1.0
>
>
>
> What is show is only a small section.
>
> I want to extract the 33,699 (w
"mtuller" <[EMAIL PROTECTED]> on 10 Feb 2007 15:03:36 -0800 didst
step forth and proclaim thus:
> Alright. I have tried everything I can find, but am not getting
> anywhere. I have a web page that has data like this:
[snip]
> What is show is only a small section.
>
> I want to extract the 33,69
Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:
LETTER
33,699
1.0
What is show is only a small section.
I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a databa
On Feb 8, 4:15 pm, "mtuller" <[EMAIL PROTECTED]> wrote:
> I was asking how to escape the quotation marks. I have everything
> working in pyparser except for that. I don't want to drop everything
> and go to a different parser.
>
> Can someone else help?
>
>
Mike -
pyparsing includes a helper for c
I was asking how to escape the quotation marks. I have everything
working in pyparser except for that. I don't want to drop everything
and go to a different parser.
Can someone else help?
>
> > I am trying to parse a webpage and extract information.
>
> BeautifulSoup is a great Python module fo
On Feb 8, 2:38 pm, "mtuller" <[EMAIL PROTECTED]> wrote:
> I am trying to parse a webpage and extract information.
BeautifulSoup is a great Python module for this purpose:
http://www.crummy.com/software/BeautifulSoup/
Here's an article on screen scraping using it:
http://iwiwdsmi.blogsp
I am trying to parse a webpage and extract information. I am trying to
use pyparser. Here is what I have:
from pyparsing import *
import urllib
# define basic text pattern
spanStart = Literal('')
spanEnd = Literal('')
printCount = spanStart + SkipTo(spanEnd) + spanEnd
# get printer addresses
p
[EMAIL PROTECTED] wrote:
> Hello,
>
> I am having some difficulty creating a regular expression for the
> following string situation in html. I want to find a table that has
> specific text in it and then extract the html just for that immediate
> table.
>
> the string would look something like th
[EMAIL PROTECTED] skrev:
> Hello,
>
> I am having some difficulty creating a regular expression for the
> following string situation in html. I want to find a table that has
> specific text in it and then extract the html just for that immediate
> table.
>
> the string would look something like t
Hi Steve,
[EMAIL PROTECTED] wrote:
> I am having some difficulty creating a regular expression for the
> following string situation in html. I want to find a table that has
> specific text in it and then extract the html just for that immediate
> table.
Any reason why you can't use a real HTML pa
Hello,
I am having some difficulty creating a regular expression for the
following string situation in html. I want to find a table that has
specific text in it and then extract the html just for that immediate
table.
the string would look something like this:
...stuff here...
...stuff here...
Fredrik Lundh wrote:
> the only difference between the libs (*) is that HTMLParser is a bit
> stricter
*) "the libs" referring to htmllib and HTMLParser, not htmllib and sgmllib.
--
http://mail.python.org/mailman/listinfo/python-list
Kenneth McDonald wrote:
> The problem I'm having with HTMLParser is simple; I don't seem to be
> getting the actual text in the HTML document. I've implemented the
> do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but
> it never seems to receive any data. Is there another way
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.TokenList = []
def handle_data( self,data):
data = data.strip()
if data and len(data) > 0:
self.TokenList.append(data)
I'm writing a program that will parse HTML and (mostly) convert it to
MediaWiki format. The two Python modules I'm aware of to do this are
HTMLParser and htmllib. However, I'm currently experiencing either real
or conceptual difficulty with both, and was wondering if I could get
some advice.
T
yaffa wrote:
> does anyone have sample code for parsting an html file to get contents
> of a td field to write to a mysql db? even if you have everything but
> the mysql db part ill take it.
http://www.crummy.com/software/BeautifulSoup/examples.html
--
http://mail.python.org/mailman/listinfo/pyt
On 4 Aug 2005 11:54:38 -0700, yaffa <[EMAIL PROTECTED]> wrote:
> does anyone have sample code for parsting an html file to get contents
> of a td field to write to a mysql db? even if you have everything but
> the mysql db part ill take it.
>
Do you want something like this?
In [1]: x = "someth
yaffa <[EMAIL PROTECTED]> wrote:
> does anyone have sample code for parsting an html file to get contents
> of a td field to write to a mysql db? even if you have everything but
> the mysql db part ill take it.
I usually use Expat XML parser to extract the field.
http://home.eol.ca/~parkw/ind
does anyone have sample code for parsting an html file to get contents
of a td field to write to a mysql db? even if you have everything but
the mysql db part ill take it.
thanks
yaffa
--
http://mail.python.org/mailman/listinfo/python-list
Thanks for the replies, I'll post here when/if I get it finally
working.
So, now I know how to extract the links for the big page, and extract
the text from the individual page. Really what I need to find out is
how run the script on each individual page automatically, and get the
output in comm
Pyparsing includes a sample program for extracting URLs from web pages.
You should be able to adapt it to this problem.
Download pyparsing at http://pyparsing.sourceforge.net
-- Paul
--
http://mail.python.org/mailman/listinfo/python-list
samuels <[EMAIL PROTECTED]> wrote:
> Hello All,
>
> I am a total python newbie, and I need help writing a script.
>
> This is what I want to do:
>
> There is a list of links at http://www.rentalhq.com/fulllist.asp. Each
> link goes to a page like,
> http://www.rentalhq.com/store.asp?id=907%2F27
Hello All,
I am a total python newbie, and I need help writing a script.
This is what I want to do:
There is a list of links at http://www.rentalhq.com/fulllist.asp. Each
link goes to a page like,
http://www.rentalhq.com/store.asp?id=907%2F272%2D4425, that contains a
company name, address, phon
[EMAIL PROTECTED] writes:
> I am trying to extract some information from a few web pages, and I was
> using the HTMLParser module. It worked fine until it got to the
> javascript, at which it gave a parse error. Is there a good way to work
> around this or should I just preparse the file to remove
<[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]
> I am trying to extract some information from a few web pages, and I was
> using the HTMLParser module. It worked fine until it got to the
> javascript, at which it gave a parse error.
It's fairly common for pages with Javascript to al
I am trying to extract some information from a few web pages, and I was
using the HTMLParser module. It worked fine until it got to the
javascript, at which it gave a parse error. Is there a good way to work
around this or should I just preparse the file to remove the javascript
manually? This is m
60 matches
Mail list logo