On Wed, Sep 28, 2016 at 02:30:53PM -0700, Ski Kacoroski wrote: > Hi, > > We host our website with schoolwires - a provider specializing in websites > for school districts. I have no access to the backend - just a simple CMS > to modify paged, add users, etc. A teacher lost a several years of work and > because they do not keep an audit trail or backups for very long, this work > is gone forever and I have one very, very upset teacher. > > So do any of you have any great ideas, wonderful software, etc that can > scrape a website on a regular basis so I could at least have provided the > content back to the teacher. I will need to get the pages (including pages > buried behind javascript and ajax buttons and menus) along with attached > files). This would also ber very useful so we can see how the site changed > over time.
Yes. Have a cron job run something like this: DAY=`date +%Y%m%d` DOMAIN_LIST="comma-separated-list-of-domains-to-follow-links-on" URL="top-url" cd /path/to/web-archive wget --mirror -P "$DAY" -p -D $DOMAIN_LIST $URL/ There you go. If it's linked, it will be copied. If it isn't reachable via a link tree from the front page, it probably violates accessibility standards... -dsr- _______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
