>>>>> "Chris" == Chris Evans <chrish...@psyctc.org> writes:
Chris> I use R a great deal but the huge web crawling power of Chris> it isn't an area I've used. I don't want to reinvent a Chris> cyberwheel and I suspect someone has done what I want. Chris> That is a program that would run once a day (easy for Chris> me to set up as a cron task) and would crawl a single Chris> root of a web site (mine) and get the file size and a Chris> CRC or some similar check value for each page as pulled Chris> off the site (and, obviously, I'd want it not to follow Chris> off site links). The other key thing would be for it to Chris> store the values and URLs and be capable of being run Chris> in "create/update database" mode or in "check pages" Chris> mode and for the change mode run to Email me a warning Chris> if a page changes. The reason I want this is that two Chris> of my sites have recently had content "disappear": Chris> neither I nor the ISP can see what's happened and we Chris> are lacking the very useful diagnostic of the date when Chris> the change happened which might have mapped it some Chris> component of WordPress, plugins or themes having Chris> updated. Chris> I am failing to find anything such and all the services Chris> that offer site checking of this sort are prohibitively Chris> expensive for me (my sites are zero income and either Chris> personal or offering free utilities and information). Chris> If anyone has done this, or something similar, I'd love Chris> to hear if you were willing to share it. Failing that, Chris> I think I will have to create this but I know it will Chris> take me days as this isn't my area of R expertise and Chris> as, to be brutally honest, I'm a pretty poor Chris> programmer. If I go that way, I'm sure people may be Chris> able to point me to things I may be (legitimately) able Chris> to recycle in parts to help construct this. Chris> Thanks in advance, Chris> Chris Chris> -- Chris> Chris Evans <ch...@psyctc.org> Skype: chris-psyctc Chris> Visiting Professor, University of Sheffield <chris.ev...@sheffield.ac.uk> Chris> I do some consultation work for the University of Roehampton <chris.ev...@roehampton.ac.uk> and other places but this <ch...@psyctc.org> remains my main Email address. Chris> I have "semigrated" to France, see: https://www.psyctc.org/pelerinage2016/semigrating-to-france/ if you want to book to talk, I am trying to keep that to Thursdays and my diary is now available at: https://www.psyctc.org/pelerinage2016/ecwd_calendar/calendar/ Chris> Beware: French time, generally an hour ahead of UK. That page will also take you to my blog which started with earlier joys in France and Spain! Not an answer, but perhaps two pointers/ideas: 1) Since you know cron, I suppose you work on a Unix-like system, and you likely have a programme called 'wget' either installed or can easily install it. 'wget' has an option 'mirror', which allows you to mirror a website. 2) There is tools::md5sum for computing checksums. You could store those to a file and check changes in the files content (e.g. via 'diff'). regards Enrico -- Enrico Schumann Lucerne, Switzerland http://enricoschumann.net ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.