On 27/04/2023 11:02, David Christensen wrote:
Things get more interesting when you approach the problem as a database. Save the content wherever and put the metadata into a table -- content hash (primary key), URL, download timestamp, author, subject, title, keywords, etc.. Create fully inverted indexes. Create a search engine. Create a spider. Implementation could range from a CSV/TSV flat-file and shell/P* scripts, to a desktop database/UI, to a LAMP stack, and beyond (NoSQL, N-tier). There are distributed file sharing systems based on such ideas.
I have never tried: "Open-source self-hosted web archiving" https://github.com/ArchiveBox/ArchiveBox This one allows to save selected part of a page: https://github.com/danny0838/webscrapbook/