On 4/27/23, David Christensen <dpchr...@holgerdanske.com> wrote: > Please see the OP, step (d).
>On 4/26/23, Albretch Mueller <lbrt...@gmail.com> wrote: >> a) encode the string name as base64 >> b) calculate the sha256sum of §a >> c) use §b as file name (of course, leaving the original extension as it >> is) >> d) include a "§b_file_name.txt" plain text file descriptor which only >> content is the actual prehash name of that file. I do that because base64 would (must?) work on any OS and the conversion from and to any other encoding is straightforward. As you suggested, I am more friendly to the idea of including hashes of the data payload, even though I think it is not that important, because the actual big problem that corpora research people have is files with exactly the same look and feel and the same content which have different hashes (for example, pdf files). I have been thinking about a way to compute hashes which resemble more faithfully, both, structural and content similarity among files. Do you know of any way to do such thing? The structural aspect should be "easy". It could be handled as DAGs of some sort of XPaths. I was actually going to show to you what I meant, but I was happy to see "I was wrong". I even waited to try it from some other access point. I have used this one liner to show how google/youtube/NSA/"Vladimir Putin"/... was watermarking files for whatever reason, but it worked fine when I was trying to show it to you ;-) _YT_URI=EngW7tLk6R8; _OFL="${_YT_URI}_"$(date +%Y%m%d%H%M%S)".mp4"; ./yt-dlp --verbose --format "mp4" --output "${_OFL}" -- "${_YT_URI}"; ls -l "${_OFL}"; file --brief "${_OFL}"; time sha256sum "${_OFL}" -rwxrwxrwx 1 user user 828540 Aug 15 2022 EngW7tLk6R8_20230501185618.mp4 ISO Media, MP4 v2 [ISO 14496-14] 0b950b88667b5fec35f3dd54005c16e5e742c703a0c776ec6da11b60a4775ae6 EngW7tLk6R8_20230501185618.mp4 -rwxrwxrwx 1 user user 828540 Aug 15 2022 EngW7tLk6R8_20230501185657.mp4 ISO Media, MP4 v2 [ISO 14496-14] 0b950b88667b5fec35f3dd54005c16e5e742c703a0c776ec6da11b60a4775ae6 EngW7tLk6R8_20230501185657.mp4 Max Nikulin (12023-04-28): > And you will quickly face servers that sends incorrectly Content-Type or > intentionally put application/octet-stream with no sniff header to force > browser to save the file instead of opening it e.g. in built-in PDF > reader. Even if not totally syntactic (so you can't functionally solve it with some code), this is a relatively manageable problem, you would: a) take notice of the sites that do such things; b) sniff not only the http headers, but notice the file extension of the file; and c) safe the file to a temp repository for the Linux util "file" to be run on it ... Out of those heuristics you should be able to strategize around such problems. lbrtchx