Looks interesting, thank you.

On Tue, Aug 11, 2015 at 5:00 PM, Daniel Janus <nath...@gmail.com> wrote:

> Dear Clojurians,
>
> I'm happy to announce the availability of the first release of Skyscraper,
> an Enlive-based library for "structural scraping" -- extracting information
> from whole sites in a structural way.
>
> Homepage / GitHub: https://github.com/nathell/skyscraper
> Leiningen: [skyscraper "0.1.0"]
> Clojars: https://clojars.org/skyscraper
>
> From the README:
>
> What is structural scraping? Think of Enlive. It allows you to parse
> arbitrary HTML and extract various bits of information out of it: subtrees
> or parts of subtrees determined by selectors. You can then convert this
> information to some other format, easier for machine consumption, or
> process it in whatever other way you wish. This is called scraping.
>
> Now imagine that you have to parse a lot of HTML documents. They all come
> from the same site, so most of them are structured in the same way and can
> be scraped using the same sets of selectors. But not all of them. There’s
> an index page, which has a different layout and needs to be treated in its
> own peculiar way, with pagination and all. There are pages that group
> together individual pages in categories. And so on. Treating single pages
> is easy, but with whole collections of pages, you quickly find yourself
> writing a lot of boilerplate code.
>
> In particular, you realize that you can’t just wget -r the whole thing and
> then parse each page in turn. Rather, you want to simulate the workflow of
> a user who tries to “click through” the website to obtain the information
> she’s interested in. Sites have tree-like structure, and you want to keep
> track of this structure as you traverse the site, and reflect it in your
> output. I call it “structural scraping”.
>
> This is where Skyscraper comes in.
>
> Happy using,
> --Daniel Janus
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to