Re: [lopsa-discuss] How to archive a blackbox hosted website

Dan Ritter Wed, 28 Sep 2016 16:24:07 -0700

On Wed, Sep 28, 2016 at 02:30:53PM -0700, Ski Kacoroski wrote:
> Hi,
> 
> We host our website with schoolwires - a provider specializing in websites
> for school districts.  I have no access to the backend - just a simple CMS
> to modify paged, add users, etc.  A teacher lost a several years of work and
> because they do not keep an audit trail or backups for very long, this work
> is gone forever and I have one very, very upset teacher.
> 
> So do any of you have any great ideas, wonderful software, etc that can
> scrape a website on a regular basis so I could at least have provided the
> content back to the teacher.  I will need to get the pages (including pages
> buried behind javascript and ajax buttons and menus) along with attached
> files).  This would also ber very useful so we can see how the site changed
> over time.


Yes.  Have a cron job run something like this:

DAY=`date +%Y%m%d`
DOMAIN_LIST="comma-separated-list-of-domains-to-follow-links-on"
URL="top-url"
cd /path/to/web-archive
wget --mirror -P "$DAY" -p -D $DOMAIN_LIST $URL/

There you go. If it's linked, it will be copied. If it isn't
reachable via a link tree from the front page, it probably
violates accessibility standards...

-dsr-
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] How to archive a blackbox hosted website

Reply via email to