On Wed, Apr 26, 2023 at 3:42 AM Albretch Mueller <lbrt...@gmail.com> wrote:
>
>  This is not a debian question per se (more like a Linux bash one),
> but I wasn't able to find an answer on the Internet.
>
>  Here is first the problem I am having before you start reading a
> conspiracy theory into it ;-)
>
>  I need to somehow map URL on the web to a local file, but you can't
> do that for two main reasons:
>
>  1) URLs are free text
>  2) which people take to their heart's content.
>
>  Take for example:
>
>  
> https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html
>
>  that file and the pdf you would download I need to map to a local
> directory looking like: ... /pub/dokumen/qdownload/ ...
>
>  but the file name (excluding the extension) is 306 characters long,
> which Windows NTFS would not swallow. There may be also funky rules
> regarding character sets and where in a string certain chars may be
> used; so, as a way to work around those kinds of problems I:
>
>  a) encode the string name as base64
>  b) calculate the sha256sum of §a
>  c) use §b as file name (of course, leaving the original extension as it is)
>  d) include a "§b_file_name.txt" plain text file decriptor which only
> content is the actual prehash name of that file.
>
>
>  
> https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html
>  
> _TXT="nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860"
>  _B64TXTENC=$(printf '%s' "${_TXT}" | base64 )
>  echo "// __ \$_B64TXTENC: |${_B64TXTENC}|"
>  _B64TXTDEC=$(printf '%s' "${_B64TXTENC}" | base64 --decode)
>  echo "// __ \$_B64TXTDEC: |${_B64TXTDEC}|"
>  if [[ "${_TXT}" == "${_B64TXTDEC}" ]]; then
>   echo "// __ [[ \${_TXT} == \${_B64TXTDEC} ]]: |${_TXT}|"
>   _SHA256=$(printf '%s' "${_TXT}" | sha256sum --text )
>   echo "// __ \$_SHA256: |${_SHA256}|"
>  fi
>
> // __ $_SHA256:
> |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb  -|
>
>  I am trying to avoid funky characters and sha256sum --text still
> generates them!?!
>
>  I work like this because I need replicate the original URL as a local
> path in a way that would be compatible any file system.
>
>  Do you know of a better way to deal with such issues?

There's no guarantee a URL will map onto a filesystem. I seem to
recall Stunnel tried to do that in a caching mode, but it had weird
corner cases. (In addition to problems with filesystems that had
character set and path limitations).

I think your best bet is to digest the URL into a representation. I
suggest using SipHash+Base64 or Base64URL. SipHash provides collision
resistance, a uniform distribution, and its fast. SipHash has a very
good pedigree since it was designed by Jean-Philippe Aumasson and
Daniel J. Bernstein. The final Base64 or Base64URL encoding ensures
you stay within printable character range without reserved file system
characters.

Jeff

Reply via email to