Re: Clojure HTML Parser (Xpath) + HTTP client?

dmix Sun, 23 Aug 2009 08:17:54 -0700

Thanks DTH.

Fortunately the HTML I am parsing is clean and it's consistently the
same pages being scraped.


Saxon seems to be most in line with what I'm looking for (handles
XPath 2.0), I'll have to try it out. Otherwise I might have to use a
java library.

-Dan


On Aug 23, 3:53 am, DTH <[email protected]> wrote:
> There are a number of options, depending on your needs:
>
> - the standard JRE libraries for xml parsing / xpath (javax.xml.*).
> These have the benefit of having seen wide usage (outside of clojure),
> and would allow you to migrate existing xpaths over unchanged.
>
> - clojure.xml - a more clojuresque way of parsing and working with xml
>
> - clojure.zip - which can take the xml from above (in addition to many
> other things) and provides a functional way of traversing and editing
> the resulting tree of elements.
>
> - clojure.contrib.zip_filter.xml - provides a means to extract data
> from clojure.xml structures using a syntax loosely similar to xpath.
>
> For working with html, I've had good experiences with c.x / c.c.zf.x,
> using tagsoup (http://home.ccil.org/~cowan/XML/tagsoup/) as the
> SAXParser in order to deal with non-xml compliant documents.
>
> If performance is your aim, you might want to investigate the clojure/
> saxon library (http://github.com/pjt/saxon/tree/master), possibly
> combined with tagsoup again to deal with dodgy html; your message
> implies that you mainly want to retrieve documents and extract a set
> of data from each using relatively static expressions (presumably the
> bulk of your business logic deals with processing this data); if this
> is indeed the case, then you could use saxon to load the documents
> returned by your http client and execute the XPaths, which I would
> imagine will be faster than using zippers.  You could also, of course,
> simply use the javax.xml.* libraries above directly to load the
> document and evaluate the xpath.
>
> -DTH
>
> On Aug 23, 2:02 am, dmix <[email protected]> wrote:
>
> > I am planning on migrating an app from ruby to clojure (for
> > performance and to learn clojure) and before I proceed I wanted to
> > make sure a few libraries are available.
>
> > One crucial part of the app is parsing a URL to return the pages HTML
> > (<html><body>...etc). Then I need to grab a certain element off the
> > page using an xpath. For example a specific images src=" ".
>
> > I found an http client on github but I haven't found any HTML parser,
> > does anyone know if one exists?

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Clojure HTML Parser (Xpath) + HTTP client?

Reply via email to