Re: Loosening mailcap filename sanitization to allow unicode filenames

Cameron Simpson Tue, 27 Apr 2021 15:48:27 -0700

On 27Apr2021 10:17, Kevin J. McCarthy <ke...@8t8.us> wrote:
>Ticket 351 on gitlab (https://gitlab.com/muttmua/mutt/-/issues/351) 
>noted that an attachment 中文名称.txt, when launched via a mailcap 
>viewer, created a tempfile "____________.txt".


Ouch.

>This is because of the sanitize_filename() functions, which have an 
>allow-list of 
>"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+@{}._-:%/" 
>(with the '/' disabled for filenames).
>
>I'd be reluctant to change sanitization for the %{<parameter>} or %t 
>expandos, but this does seem to be a bit strict for the filename. 
>Oswald notes in the ticket that 8-bit characters are harmless at the 
>system level (Oswald, feel free to reply/clarify - I'm not trying to 
>put words in your mouth).

First remark:

I think we should make clear that this only makes sense when you're 
encoding filenames as UTF-8, where all multibyte sequences have a high 
bit set. This isn't necessarily the case with other encodings.

Second remark:

As one who has long been less than enthused by sanitising filenames, 
what exactly are we trying to accomplish when we sanitise a filename?

- avoid trickiness like whitespace and quote characters, which cause a 
  little pain for users of the files in scripting settings?

- avoiding $ and ` et al, which cause hazards for the very careless 
  script author? (but inly if injected blindly)

- avoiding other shell punctuation like redirections? same issue

- avoiding escape paths such as absolute paths (/etc/passwd, oh root-run 
  mutt user?) or ../blah to get out of the scratch area?

Without qualifying these objectives, "sanitisation" means little (or too 
much, depending where you stand).

>On the one hand these are temp files, but Mutt already tries to 
>preserve the filename to make for a nicer user interaction.  It seems 
>if we can preserve unicode filenames better we ought to do that too.

"Unicode filenames" isn't a meaningful term in UNIX, as the API is C 
strings - byte sequences with NUL terminators. I suspect you mean "UTF-8 
encoded names", which is the common modern default.

>What if we added an allow_8bit parameter to the function, that also 
>passed through bytes with the 8th bit set?  I'd keep this set off in 
>all other invocations except the mailcap invocations.

Of course, the trickiness is that header things like filenames are, 
IIRC, "bytes". Without a charset, do we inherently know anything about 
them _as characters_?

I'm +1 for allow_8bit if we make it clear in the docs (and implemented 
it correctly in the code) that this refers to the in-filesystem byte 
encoding of the filename. _Not_ hypothetical "Unicode". One person's 
"Unicode" is another's Shift-JIS :-)

    https://en.wikipedia.org/wiki/Mojibake

One has but to shift one's shell locale to see this play out.

Cheers,
Cameron Simpson <c...@cskk.id.au>

Re: Loosening mailcap filename sanitization to allow unicode filenames

Reply via email to