Re: [ANN] Skyscraper 0.1.0, a library for scraping entire websites

Daniel Janus Mon, 24 Aug 2015 14:56:13 -0700

[Reusing the relatively new thread to publish information about new 
release:]


Skyscraper 0.1.1 is now out.  New in this release:

   - Processors (process-fn functions) can now access current context.
   - Skyscraper now uses clj-http <https://github.com/dakrone/clj-http> to 
   issue HTTP GET requests. 
      - Skyscraper can now auto-detect page encoding thanks to clj-http’s 
      decode-body-headers feature.
      - scrape now supports a http-options argument to override HTTP 
      options (e.g., timeouts).
   - Skyscraper’s output is now fully lazy (i.e., guaranteed to be 
   non-chunking).
   - Fixed a bug where relative URLs were incorrectly resolved in certain 
   circumstances.

Happy using,
-dj

W dniu wtorek, 11 sierpnia 2015 19:29:03 UTC+2 użytkownik Sergey Didenko 
napisał:
>
> Looks interesting, thank you.
>
> On Tue, Aug 11, 2015 at 5:00 PM, Daniel Janus <nat...@gmail.com 
> <javascript:>> wrote:
>
>> Dear Clojurians,
>>
>> I'm happy to announce the availability of the first release of 
>> Skyscraper, an Enlive-based library for "structural scraping" -- extracting 
>> information from whole sites in a structural way.
>>
>> Homepage / GitHub: https://github.com/nathell/skyscraper
>> Leiningen: [skyscraper "0.1.0"]
>> Clojars: https://clojars.org/skyscraper
>>
>> From the README:
>>
>> What is structural scraping? Think of Enlive. It allows you to parse 
>> arbitrary HTML and extract various bits of information out of it: subtrees 
>> or parts of subtrees determined by selectors. You can then convert this 
>> information to some other format, easier for machine consumption, or 
>> process it in whatever other way you wish. This is called scraping.
>>
>> Now imagine that you have to parse a lot of HTML documents. They all come 
>> from the same site, so most of them are structured in the same way and can 
>> be scraped using the same sets of selectors. But not all of them. There’s 
>> an index page, which has a different layout and needs to be treated in its 
>> own peculiar way, with pagination and all. There are pages that group 
>> together individual pages in categories. And so on. Treating single pages 
>> is easy, but with whole collections of pages, you quickly find yourself 
>> writing a lot of boilerplate code.
>>
>> In particular, you realize that you can’t just wget -r the whole thing 
>> and then parse each page in turn. Rather, you want to simulate the workflow 
>> of a user who tries to “click through” the website to obtain the 
>> information she’s interested in. Sites have tree-like structure, and you 
>> want to keep track of this structure as you traverse the site, and reflect 
>> it in your output. I call it “structural scraping”.
>>
>> This is where Skyscraper comes in.
>>
>> Happy using,
>> --Daniel Janus
>>
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clo...@googlegroups.com 
>> <javascript:>
>> Note that posts from new members are moderated - please be patient with 
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+u...@googlegroups.com <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to clojure+u...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [ANN] Skyscraper 0.1.0, a library for scraping entire websites

Reply via email to