On 4/26/23 16:21, Albretch Mueller wrote:
On 4/26/23, David Christensen <dpchr...@holgerdanske.com> wrote:
I suggest hashing the document content rather than the URL. This would
work nicely for static documents.
What do you mean by "hashing the document content"?
2023-04-26 21:03:08 dpchrist@taz ~
$ touch foo
2023-04-26 21:03:12 dpchrist@taz ~
$ sha256sum foo
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 foo
In this case, the content is an empty string and the hexadecimal
encoding of the the SHA256 hash is
"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855".
How would that help when what you are trying to do is cleanse and
canonize texts as best as you could to find relationships among their
text segments?
lbrtchx
* Each unique text would be stored once regardless of how many URL's
link to it.
* If the content at a URL changes, the new content will have a new hash.
So, the new content will be saved and the old content will be
preserved (instead of the new content overwriting the old content).
* With regard to my response to the post by Nicolas George, a database
of metadata could benefit analysis regardless of the scheme used to name
content files.
David