On 4/27/23 01:04, Nicolas George wrote:
David Christensen (12023-04-26):
My suggestion assumes that the URL => hash => content mapping is saved
somehow.
That is an assumption that needed to be made explicit from the start.
For example, save the content in a file named after the hash and
save the URL in a file whose name is the hash plus a suffix. Finding a
document by URL then becomes a grep(1) invocation.
This is not very efficient.
Please see the OP, step (d).
You are free to propose better solutions.
On 4/26/23 21:02, David Christensen wrote:
> Things get more interesting when you approach the problem as a database.
> Save the content wherever and put the metadata into a table -- content
> hash (primary key), URL, download timestamp, author, subject, title,
> keywords, etc.. Create fully inverted indexes. Create a search engine.
> Create a spider. Implementation could range from a CSV/TSV flat-file
> and shell/P* scripts, to a desktop database/UI, to a LAMP stack, and
> beyond (NoSQL, N-tier). There are distributed file sharing systems
> based on such ideas.
David