On 2017-06-06 01:34:37 +0200, Andries E. Brouwer wrote:
It is also very easy to check for well-formed UTF-8. Well-formed UTF-8 with short lines should perhaps be classified as "text".
As a UTF-8 zealot, I would like to see this stricter heuristic being applied, with all other files attached without text interpretation. Rejecting any invalid UTF-8 by applying RFC 3629 strictly would be reasonable for this use case. It would also be reasonable to follow file/libmagic's stance in file_looks_utf8() (at least in file-4.26 and later) and conclude that any "odd" control characters are enough to disqualify the file from being text. ftp://ftp.astron.com/pub/file/ The (hibin+ascii)/lobin >= 9 hack that has been in the code since at least 2002 needs to go. The question is what to do with a text file that is correctly encoded for its locale, but that fails to be valid UTF-8. Even a Windows-125[0-8] or GB2312 encoded text file that is not valid UTF-8 is "clearly" binary to me, but there are probably many mutt users who would object to such semantics. A general solution would be to determine the semantics of "text" depending on locale, or to use a conservative heuristic but allow this to be overridden as an option. -- Andras Salamon and...@dns.net