[Reusing the relatively new thread to publish information about new release:]
Skyscraper 0.1.1 is now out. New in this release: - Processors (process-fn functions) can now access current context. - Skyscraper now uses clj-http <https://github.com/dakrone/clj-http> to issue HTTP GET requests. - Skyscraper can now auto-detect page encoding thanks to clj-http’s decode-body-headers feature. - scrape now supports a http-options argument to override HTTP options (e.g., timeouts). - Skyscraper’s output is now fully lazy (i.e., guaranteed to be non-chunking). - Fixed a bug where relative URLs were incorrectly resolved in certain circumstances. Happy using, -dj W dniu wtorek, 11 sierpnia 2015 19:29:03 UTC+2 użytkownik Sergey Didenko napisał: > > Looks interesting, thank you. > > On Tue, Aug 11, 2015 at 5:00 PM, Daniel Janus <nat...@gmail.com > <javascript:>> wrote: > >> Dear Clojurians, >> >> I'm happy to announce the availability of the first release of >> Skyscraper, an Enlive-based library for "structural scraping" -- extracting >> information from whole sites in a structural way. >> >> Homepage / GitHub: https://github.com/nathell/skyscraper >> Leiningen: [skyscraper "0.1.0"] >> Clojars: https://clojars.org/skyscraper >> >> From the README: >> >> What is structural scraping? Think of Enlive. It allows you to parse >> arbitrary HTML and extract various bits of information out of it: subtrees >> or parts of subtrees determined by selectors. You can then convert this >> information to some other format, easier for machine consumption, or >> process it in whatever other way you wish. This is called scraping. >> >> Now imagine that you have to parse a lot of HTML documents. They all come >> from the same site, so most of them are structured in the same way and can >> be scraped using the same sets of selectors. But not all of them. There’s >> an index page, which has a different layout and needs to be treated in its >> own peculiar way, with pagination and all. There are pages that group >> together individual pages in categories. And so on. Treating single pages >> is easy, but with whole collections of pages, you quickly find yourself >> writing a lot of boilerplate code. >> >> In particular, you realize that you can’t just wget -r the whole thing >> and then parse each page in turn. Rather, you want to simulate the workflow >> of a user who tries to “click through” the website to obtain the >> information she’s interested in. Sites have tree-like structure, and you >> want to keep track of this structure as you traverse the site, and reflect >> it in your output. I call it “structural scraping”. >> >> This is where Skyscraper comes in. >> >> Happy using, >> --Daniel Janus >> >> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clo...@googlegroups.com >> <javascript:> >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clojure+u...@googlegroups.com <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "Clojure" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to clojure+u...@googlegroups.com <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.