Re: Clojure HTML Parser (Xpath) + HTTP client?

DTH Sun, 23 Aug 2009 06:28:58 -0700

There are a number of options, depending on your needs:

- the standard JRE libraries for xml parsing / xpath (javax.xml.*).
These have the benefit of having seen wide usage (outside of clojure),
and would allow you to migrate existing xpaths over unchanged.

- clojure.xml - a more clojuresque way of parsing and working with xml

- clojure.zip - which can take the xml from above (in addition to many
other things) and provides a functional way of traversing and editing
the resulting tree of elements.

- clojure.contrib.zip_filter.xml - provides a means to extract data
from clojure.xml structures using a syntax loosely similar to xpath.

For working with html, I've had good experiences with c.x / c.c.zf.x,
using tagsoup (http://home.ccil.org/~cowan/XML/tagsoup/) as the
SAXParser in order to deal with non-xml compliant documents.

If performance is your aim, you might want to investigate the clojure/
saxon library (http://github.com/pjt/saxon/tree/master), possibly
combined with tagsoup again to deal with dodgy html; your message
implies that you mainly want to retrieve documents and extract a set
of data from each using relatively static expressions (presumably the
bulk of your business logic deals with processing this data); if this
is indeed the case, then you could use saxon to load the documents
returned by your http client and execute the XPaths, which I would
imagine will be faster than using zippers.  You could also, of course,
simply use the javax.xml.* libraries above directly to load the
document and evaluate the xpath.

-DTH

On Aug 23, 2:02 am, dmix <liftedme...@gmail.com> wrote:
> I am planning on migrating an app from ruby to clojure (for
> performance and to learn clojure) and before I proceed I wanted to
> make sure a few libraries are available.
>
> One crucial part of the app is parsing a URL to return the pages HTML
> (<html><body>...etc). Then I need to grab a certain element off the
> page using an xpath. For example a specific images src=" ".
>
> I found an http client on github but I haven't found any HTML parser,
> does anyone know if one exists?

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Clojure HTML Parser (Xpath) + HTTP client?

Reply via email to