On Wed, Apr 26, 2023 at 3:42 AM Albretch Mueller <lbrt...@gmail.com> wrote: > > This is not a debian question per se (more like a Linux bash one), > but I wasn't able to find an answer on the Internet. > > Here is first the problem I am having before you start reading a > conspiracy theory into it ;-) > > I need to somehow map URL on the web to a local file, but you can't > do that for two main reasons: > > 1) URLs are free text > 2) which people take to their heart's content. > > Take for example: > > > https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html > > that file and the pdf you would download I need to map to a local > directory looking like: ... /pub/dokumen/qdownload/ ... > > but the file name (excluding the extension) is 306 characters long, > which Windows NTFS would not swallow. There may be also funky rules > regarding character sets and where in a string certain chars may be > used; so, as a way to work around those kinds of problems I: > > a) encode the string name as base64 > b) calculate the sha256sum of §a > c) use §b as file name (of course, leaving the original extension as it is) > d) include a "§b_file_name.txt" plain text file decriptor which only > content is the actual prehash name of that file. > > > > https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html > > _TXT="nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860" > _B64TXTENC=$(printf '%s' "${_TXT}" | base64 ) > echo "// __ \$_B64TXTENC: |${_B64TXTENC}|" > _B64TXTDEC=$(printf '%s' "${_B64TXTENC}" | base64 --decode) > echo "// __ \$_B64TXTDEC: |${_B64TXTDEC}|" > if [[ "${_TXT}" == "${_B64TXTDEC}" ]]; then > echo "// __ [[ \${_TXT} == \${_B64TXTDEC} ]]: |${_TXT}|" > _SHA256=$(printf '%s' "${_TXT}" | sha256sum --text ) > echo "// __ \$_SHA256: |${_SHA256}|" > fi > > // __ $_SHA256: > |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb -| > > I am trying to avoid funky characters and sha256sum --text still > generates them!?! > > I work like this because I need replicate the original URL as a local > path in a way that would be compatible any file system. > > Do you know of a better way to deal with such issues?
There's no guarantee a URL will map onto a filesystem. I seem to recall Stunnel tried to do that in a caching mode, but it had weird corner cases. (In addition to problems with filesystems that had character set and path limitations). I think your best bet is to digest the URL into a representation. I suggest using SipHash+Base64 or Base64URL. SipHash provides collision resistance, a uniform distribution, and its fast. SipHash has a very good pedigree since it was designed by Jean-Philippe Aumasson and Daniel J. Bernstein. The final Base64 or Base64URL encoding ensures you stay within printable character range without reserved file system characters. Jeff