On 27/04/2023 11:02, David Christensen wrote:
Things get more interesting when you approach the problem as a database.  Save the content wherever and put the metadata into a table -- content hash (primary key), URL, download timestamp, author, subject, title, keywords, etc..  Create fully inverted indexes.  Create a search engine.  Create a spider.  Implementation could range from a CSV/TSV flat-file and shell/P* scripts, to a desktop database/UI, to a LAMP stack, and beyond (NoSQL, N-tier).  There are distributed file sharing systems based on such ideas.

I have never tried: "Open-source self-hosted web archiving"
https://github.com/ArchiveBox/ArchiveBox

This one allows to save selected part of a page:
https://github.com/danny0838/webscrapbook/

Reply via email to