On 27Apr2021 10:17, Kevin J. McCarthy <ke...@8t8.us> wrote: >Ticket 351 on gitlab (https://gitlab.com/muttmua/mutt/-/issues/351) >noted that an attachment 中文名称.txt, when launched via a mailcap >viewer, created a tempfile "____________.txt".
Ouch. >This is because of the sanitize_filename() functions, which have an >allow-list of >"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+@{}._-:%/" >(with the '/' disabled for filenames). > >I'd be reluctant to change sanitization for the %{<parameter>} or %t >expandos, but this does seem to be a bit strict for the filename. >Oswald notes in the ticket that 8-bit characters are harmless at the >system level (Oswald, feel free to reply/clarify - I'm not trying to >put words in your mouth). First remark: I think we should make clear that this only makes sense when you're encoding filenames as UTF-8, where all multibyte sequences have a high bit set. This isn't necessarily the case with other encodings. Second remark: As one who has long been less than enthused by sanitising filenames, what exactly are we trying to accomplish when we sanitise a filename? - avoid trickiness like whitespace and quote characters, which cause a little pain for users of the files in scripting settings? - avoiding $ and ` et al, which cause hazards for the very careless script author? (but inly if injected blindly) - avoiding other shell punctuation like redirections? same issue - avoiding escape paths such as absolute paths (/etc/passwd, oh root-run mutt user?) or ../blah to get out of the scratch area? Without qualifying these objectives, "sanitisation" means little (or too much, depending where you stand). >On the one hand these are temp files, but Mutt already tries to >preserve the filename to make for a nicer user interaction. It seems >if we can preserve unicode filenames better we ought to do that too. "Unicode filenames" isn't a meaningful term in UNIX, as the API is C strings - byte sequences with NUL terminators. I suspect you mean "UTF-8 encoded names", which is the common modern default. >What if we added an allow_8bit parameter to the function, that also >passed through bytes with the 8th bit set? I'd keep this set off in >all other invocations except the mailcap invocations. Of course, the trickiness is that header things like filenames are, IIRC, "bytes". Without a charset, do we inherently know anything about them _as characters_? I'm +1 for allow_8bit if we make it clear in the docs (and implemented it correctly in the code) that this refers to the in-filesystem byte encoding of the filename. _Not_ hypothetical "Unicode". One person's "Unicode" is another's Shift-JIS :-) https://en.wikipedia.org/wiki/Mojibake One has but to shift one's shell locale to see this play out. Cheers, Cameron Simpson <c...@cskk.id.au>