Re: [ANN] Skyscraper 0.1.0, a library for scraping entire websites

Bryan Maass Mon, 24 Aug 2015 18:10:40 -0700

Thanks DJ,

I think this is a good approach for scraping hierarchical (or not) sets of 
webpages and boiling them down to maps.


On Tuesday, August 25, 2015 at 7:55:36 AM UTC+10, Daniel Janus wrote:
>
> [Reusing the relatively new thread to publish information about new 
> release:]
>
> Skyscraper 0.1.1 is now out.  New in this release:
>
>    - Processors (process-fn functions) can now access current context.
>    - Skyscraper now uses clj-http <https://github.com/dakrone/clj-http> 
>    to issue HTTP GET requests. 
>       - Skyscraper can now auto-detect page encoding thanks to clj-http’s 
>       decode-body-headers feature.
>       - scrape now supports a http-options argument to override HTTP 
>       options (e.g., timeouts).
>    - Skyscraper’s output is now fully lazy (i.e., guaranteed to be 
>    non-chunking).
>    - Fixed a bug where relative URLs were incorrectly resolved in certain 
>    circumstances.
>
> Happy using,
> -dj
>
> W dniu wtorek, 11 sierpnia 2015 19:29:03 UTC+2 użytkownik Sergey Didenko 
> napisał:
>>
>> Looks interesting, thank you.
>>
>> On Tue, Aug 11, 2015 at 5:00 PM, Daniel Janus <nat...@gmail.com> wrote:
>>
>>> Dear Clojurians,
>>>
>>> I'm happy to announce the availability of the first release of 
>>> Skyscraper, an Enlive-based library for "structural scraping" -- extracting 
>>> information from whole sites in a structural way.
>>>
>>> Homepage / GitHub: https://github.com/nathell/skyscraper
>>> Leiningen: [skyscraper "0.1.0"]
>>> Clojars: https://clojars.org/skyscraper
>>>
>>> From the README:
>>>
>>> What is structural scraping? Think of Enlive. It allows you to parse 
>>> arbitrary HTML and extract various bits of information out of it: subtrees 
>>> or parts of subtrees determined by selectors. You can then convert this 
>>> information to some other format, easier for machine consumption, or 
>>> process it in whatever other way you wish. This is called scraping.
>>>
>>> Now imagine that you have to parse a lot of HTML documents. They all 
>>> come from the same site, so most of them are structured in the same way and 
>>> can be scraped using the same sets of selectors. But not all of them. 
>>> There’s an index page, which has a different layout and needs to be treated 
>>> in its own peculiar way, with pagination and all. There are pages that 
>>> group together individual pages in categories. And so on. Treating single 
>>> pages is easy, but with whole collections of pages, you quickly find 
>>> yourself writing a lot of boilerplate code.
>>>
>>> In particular, you realize that you can’t just wget -r the whole thing 
>>> and then parse each page in turn. Rather, you want to simulate the workflow 
>>> of a user who tries to “click through” the website to obtain the 
>>> information she’s interested in. Sites have tree-like structure, and you 
>>> want to keep track of this structure as you traverse the site, and reflect 
>>> it in your output. I call it “structural scraping”.
>>>
>>> This is where Skyscraper comes in.
>>>
>>> Happy using,
>>> --Daniel Janus
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To post to this group, send email to clo...@googlegroups.com
>>> Note that posts from new members are moderated - please be patient with 
>>> your first post.
>>> To unsubscribe from this group, send email to
>>> clojure+u...@googlegroups.com
>>> For more options, visit this group at
>>> http://groups.google.com/group/clojure?hl=en
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Clojure" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to clojure+u...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [ANN] Skyscraper 0.1.0, a library for scraping entire websites

Reply via email to