Dear Clojurians,

I'm happy to announce the availability of the first release of Skyscraper, 
an Enlive-based library for "structural scraping" -- extracting information 
from whole sites in a structural way.

Homepage / GitHub: https://github.com/nathell/skyscraper
Leiningen: [skyscraper "0.1.0"]
Clojars: https://clojars.org/skyscraper

>From the README:

What is structural scraping? Think of Enlive. It allows you to parse 
arbitrary HTML and extract various bits of information out of it: subtrees 
or parts of subtrees determined by selectors. You can then convert this 
information to some other format, easier for machine consumption, or 
process it in whatever other way you wish. This is called scraping.

Now imagine that you have to parse a lot of HTML documents. They all come 
from the same site, so most of them are structured in the same way and can 
be scraped using the same sets of selectors. But not all of them. There’s 
an index page, which has a different layout and needs to be treated in its 
own peculiar way, with pagination and all. There are pages that group 
together individual pages in categories. And so on. Treating single pages 
is easy, but with whole collections of pages, you quickly find yourself 
writing a lot of boilerplate code.

In particular, you realize that you can’t just wget -r the whole thing and 
then parse each page in turn. Rather, you want to simulate the workflow of 
a user who tries to “click through” the website to obtain the information 
she’s interested in. Sites have tree-like structure, and you want to keep 
track of this structure as you traverse the site, and reflect it in your 
output. I call it “structural scraping”.

This is where Skyscraper comes in.

Happy using,
--Daniel Janus

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to