Looks interesting, thank you. On Tue, Aug 11, 2015 at 5:00 PM, Daniel Janus <nath...@gmail.com> wrote:
> Dear Clojurians, > > I'm happy to announce the availability of the first release of Skyscraper, > an Enlive-based library for "structural scraping" -- extracting information > from whole sites in a structural way. > > Homepage / GitHub: https://github.com/nathell/skyscraper > Leiningen: [skyscraper "0.1.0"] > Clojars: https://clojars.org/skyscraper > > From the README: > > What is structural scraping? Think of Enlive. It allows you to parse > arbitrary HTML and extract various bits of information out of it: subtrees > or parts of subtrees determined by selectors. You can then convert this > information to some other format, easier for machine consumption, or > process it in whatever other way you wish. This is called scraping. > > Now imagine that you have to parse a lot of HTML documents. They all come > from the same site, so most of them are structured in the same way and can > be scraped using the same sets of selectors. But not all of them. There’s > an index page, which has a different layout and needs to be treated in its > own peculiar way, with pagination and all. There are pages that group > together individual pages in categories. And so on. Treating single pages > is easy, but with whole collections of pages, you quickly find yourself > writing a lot of boilerplate code. > > In particular, you realize that you can’t just wget -r the whole thing and > then parse each page in turn. Rather, you want to simulate the workflow of > a user who tries to “click through” the website to obtain the information > she’s interested in. Sites have tree-like structure, and you want to keep > track of this structure as you traverse the site, and reflect it in your > output. I call it “structural scraping”. > > This is where Skyscraper comes in. > > Happy using, > --Daniel Janus > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.