On 2017-06-06 01:34:37 +0200, Andries E. Brouwer wrote:
It is also very easy to check for well-formed UTF-8. Well-formed UTF-8
with short lines should perhaps be classified as "text".

As a UTF-8 zealot, I would like to see this stricter heuristic being
applied, with all other files attached without text interpretation.
Rejecting any invalid UTF-8 by applying RFC 3629 strictly would be
reasonable for this use case.  It would also be reasonable to follow
file/libmagic's stance in file_looks_utf8() (at least in file-4.26 and
later) and conclude that any "odd" control characters are enough to
disqualify the file from being text.
   ftp://ftp.astron.com/pub/file/

The (hibin+ascii)/lobin >= 9 hack that has been in the code since at
least 2002 needs to go.  The question is what to do with a text file
that is correctly encoded for its locale, but that fails to be valid
UTF-8.  Even a Windows-125[0-8] or GB2312 encoded text file that is not
valid UTF-8 is "clearly" binary to me, but there are probably many mutt
users who would object to such semantics.  A general solution would be
to determine the semantics of "text" depending on locale, or to use a
conservative heuristic but allow this to be overridden as an option.

-- Andras Salamon                   and...@dns.net

Reply via email to