On Mon, Jun 05, 2017 at 03:35:28PM -0700, Kevin J. McCarthy wrote:
> On Mon, Jun 05, 2017 at 02:44:51PM -0700, Kevin J. McCarthy wrote:
> > On Mon, Jun 05, 2017 at 10:12:47PM +0200, Andries E. Brouwer wrote:
> > > Clearly, this 10% test is completely bogus.
> > > 
> > > More in particular, I think that a file is binary if it contains even
> > > a single NUL byte.
> > > 
> > > Should I propose a patch?
> > 
> > I proposed a patch quite awhile ago for this same problem.  Let me see
> > if I can dig it up.
> 
> It was originally for ticket 2933.  I'm attaching it here.  I think I
> didn't push it because one of the other committers suggested looking
> into libmagic instead, but I'd be interested if this fixes the problem
> for you.


> # HG changeset patch
> # User Kevin McCarthy <ke...@8t8.us>
> # Date 1496701910 25200
> #      Mon Jun 05 15:31:50 2017 -0700
> # Node ID 0053fd3b5296e024ff5821a0a697c1f445c7e85a
> # Parent  a11770c2137b4973efe77b4e9d7356f22d2ae5f7
> Make attachment type guessing more conservative. (closes #2933).
> 
> When guessing the type of an attachment (with no mime type or
> extension), mutt currently considers 8-bit characters as "text" when
> calculating the percentage of text characters in the file.
> 
> In some cases, such as for the sample attachment in this ticket, it leads
> to a binary executable being labelled as text/plain, which results in the
> the attachment being corrupted when mailed.
> 
> This patch considers 8-bit characters as binary for the calculation.  In
> general, it's probably better to guess wrong on the conservative side
> than possibly corrupt attachments.

> -    if (info->lobin == 0 || (info->lobin + info->hibin + info->ascii)/ 
> info->lobin >= 10)
> +    if ((info->lobin == 0 && info->hibin == 0) ||
> +        (info->lobin + info->hibin + info->ascii) / (info->lobin + 
> info->hibin) >= 10)

Yes, this fixes my problem.
On the other hand, some simple UTF-8 text files are now also treated as binary.
Perhaps "hibin" is a misnomer, the world is no longer ascii-only.

Slowly, UTF-8 is becoming the most common encoding one meets.
My favorite solution would be to always guess on the conservative side
and never guess text/plain.

But if one wants to have a reasonable guess one needs more data than is
collected right now. It is very easy to count NUL-bytes separately
and make the occurrence of a NUL byte sufficient for the choice "binary".
(That would have been the patch I planned to propose.)
It is also very easy to check for well-formed UTF-8. Well-formed UTF-8
with short lines should perhaps be classified as "text".

Andries

Reply via email to