Thanks DJ, I think this is a good approach for scraping hierarchical (or not) sets of webpages and boiling them down to maps.
On Tuesday, August 25, 2015 at 7:55:36 AM UTC+10, Daniel Janus wrote: > > [Reusing the relatively new thread to publish information about new > release:] > > Skyscraper 0.1.1 is now out. New in this release: > > - Processors (process-fn functions) can now access current context. > - Skyscraper now uses clj-http <https://github.com/dakrone/clj-http> > to issue HTTP GET requests. > - Skyscraper can now auto-detect page encoding thanks to clj-http’s > decode-body-headers feature. > - scrape now supports a http-options argument to override HTTP > options (e.g., timeouts). > - Skyscraper’s output is now fully lazy (i.e., guaranteed to be > non-chunking). > - Fixed a bug where relative URLs were incorrectly resolved in certain > circumstances. > > Happy using, > -dj > > W dniu wtorek, 11 sierpnia 2015 19:29:03 UTC+2 użytkownik Sergey Didenko > napisał: >> >> Looks interesting, thank you. >> >> On Tue, Aug 11, 2015 at 5:00 PM, Daniel Janus <nat...@gmail.com> wrote: >> >>> Dear Clojurians, >>> >>> I'm happy to announce the availability of the first release of >>> Skyscraper, an Enlive-based library for "structural scraping" -- extracting >>> information from whole sites in a structural way. >>> >>> Homepage / GitHub: https://github.com/nathell/skyscraper >>> Leiningen: [skyscraper "0.1.0"] >>> Clojars: https://clojars.org/skyscraper >>> >>> From the README: >>> >>> What is structural scraping? Think of Enlive. It allows you to parse >>> arbitrary HTML and extract various bits of information out of it: subtrees >>> or parts of subtrees determined by selectors. You can then convert this >>> information to some other format, easier for machine consumption, or >>> process it in whatever other way you wish. This is called scraping. >>> >>> Now imagine that you have to parse a lot of HTML documents. They all >>> come from the same site, so most of them are structured in the same way and >>> can be scraped using the same sets of selectors. But not all of them. >>> There’s an index page, which has a different layout and needs to be treated >>> in its own peculiar way, with pagination and all. There are pages that >>> group together individual pages in categories. And so on. Treating single >>> pages is easy, but with whole collections of pages, you quickly find >>> yourself writing a lot of boilerplate code. >>> >>> In particular, you realize that you can’t just wget -r the whole thing >>> and then parse each page in turn. Rather, you want to simulate the workflow >>> of a user who tries to “click through” the website to obtain the >>> information she’s interested in. Sites have tree-like structure, and you >>> want to keep track of this structure as you traverse the site, and reflect >>> it in your output. I call it “structural scraping”. >>> >>> This is where Skyscraper comes in. >>> >>> Happy using, >>> --Daniel Janus >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Clojure" group. >>> To post to this group, send email to clo...@googlegroups.com >>> Note that posts from new members are moderated - please be patient with >>> your first post. >>> To unsubscribe from this group, send email to >>> clojure+u...@googlegroups.com >>> For more options, visit this group at >>> http://groups.google.com/group/clojure?hl=en >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "Clojure" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to clojure+u...@googlegroups.com. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.