On Wednesday, December 9, 2015 at 6:33:02 PM UTC-8, Neil Van Dyke wrote: > David K. Storrs wrote on 12/09/2015 08:50 PM: > > 1) Is there a web-spidering package that people recommend? I could use > > wget and then parse things from disk, but I'd like to have something that's > > easily composable into CLI scripts. > > > I've done a lot of Web crawling and scraping successfully with Racket > and Scheme, over the last 14-15 years. I released an HTML parser > ("http://www.neilvandyke.org/racket-html-parsing/"), which I still use > today. From that parse, you might then extract the info you need with > `sxml-match` > ("http://planet.racket-lang.org/display.ss?package=sxml-match.plt&owner=jim") > and/or SXPath.
Thank you; I've been rolling through the docs and playing around on thse, and they seem really useful. One question though -- I stumbled across a mention of the sxml/html module while I was reading, but had no luck installing it. None of the following worked: (require sxml/html) $ raco pkg install sxml/html $ raco pkg install 'sxml/html' # Maybe the shell was having trouble with '/'? I don't know that I need it, but I'd like to know how to deal with modules like this in future. > For HTTP, the client modules in Racket are often > satisfactory, and other times I've used my own packages that implement > HTTP in pure Racket or that wrap `curl` or `wget` for special > requirements. For storing pages and links/metadata, there's the > filesystem, the core Racket RDBMS database support, and cloud stores > like AWS S3. The un-AJAX-ing and site-specific scraping behavior you > might have to do yourself, if you need it. (I have a backlog of related > tools to release someday.) Great, thank you. Yeah, I'd really like to be able to automate posting to Patreon. (Every week I publish a chapter of my novel there.) Unfortunately, their whole site is pointlessly AJAX. I spent some time Firebugging their code to see what the relevant calls were and then decided to spend my time on something more useful. >From what you say it sounds like there's no "magically make stupid AJAX / >DOM-manipulating sites easy to deal with" module for Racket? Something that >processed the site and handed back the final HTML as the browser gets it >post-JS would be lovely. It's a bit much to ask for, I realize. Thanks again for all this -- it's a big help. Dave -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.