The following module was proposed for inclusion in the Module List: modid: WWW::HtmlUnit::Spidey DSLIP: adpho description: Web scraping library, scalable, JS support userid: NINUZZO (Antonio Bonifati) chapterid: 15 (World_Wide_Web_HTML_HTTP_CGI) communities:
similar: WWW::HtmlUnit::Sweet rationale: This module builds upon WWW::HtmlUnit to provide an easy-to-use interface to the Java web scraping library HtmlUnit. Thus it is appropriate to put it under the WWW::HtmlUnit namespace. My approach was to use multiple programming paradigms (functional, declarative and object based) to devise a Domain Specific Language for writing scalable web crawlers with some good JavaScript support, which ATTOW is lacking in every other Perl web scraping toolkit, except WWW::HtmlUnit::Sweet. I have asked Brock Wilcox <awwa...@thelackthereof.org> for permission to use his namespace prefix WWW::HtmlUnit and he agreed. He reckons Spidey different enough from WWW::HtmlUnit::Sweet and a welcoming alternative. In fact I departed from any Mechanize-like syntax for good reasons: * a multi-paradigm DSL would produce spiders easier to develop, maintain and debug * miming the Mechanize interface would be restrictive unless one extends it to provide additional features, but then it would become incompatible * HtmlUnit is quite different from Mechanize, fitting the interface of the former into the latter would just be a twist with no advantages * interchangeability with Mechanize is not possible, because spiders written with Spidey will usually rely on JavaScript support, something that Mechanize does not have and won't have in the near future. E.g. if JavaScript is needed to submit a form, Mechanize can not handle it directly while Spidey will, without requiring you to write additional code into your spider to emulate the JS behaviour. Comparison with WWW::HtmlUnit::Sweet: Both Sweet and Spidey support JS through HtmlUnit but while the former is targeted to web testing, the latter is specific to web harvesting. Infact Spidey is not only an headless browser with JS support, but also offers some facilities for data extraction, conversion, logging and debugging. All these feature are needed to write batch-mode robust web scrapers to harvest data from the currently unstructured WWW. enteredby: NINUZZO (Antonio Bonifati) enteredon: Sat Mar 12 18:54:39 2011 GMT The resulting entry would be: WWW::HtmlUnit:: ::Spidey adpho Web scraping library, scalable, JS support NINUZZO Thanks for registering, -- The PAUSE PS: The following links are only valid for module list maintainers: Registration form with editing capabilities: https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=d6500000_b3ea9e859868b6e2&SUBMIT_pause99_add_mod_preview=1 Immediate (one click) registration: https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=d6500000_b3ea9e859868b6e2&SUBMIT_pause99_add_mod_insertit=1 Peek at the current permissions: https://pause.perl.org/pause/authenquery?pause99_peek_perms_by=me&pause99_peek_perms_query=WWW%3A%3AHtmlUnit%3A%3ASpidey