On Mon, Jun 05, 2017 at 03:35:28PM -0700, Kevin J. McCarthy wrote: > On Mon, Jun 05, 2017 at 02:44:51PM -0700, Kevin J. McCarthy wrote: > > On Mon, Jun 05, 2017 at 10:12:47PM +0200, Andries E. Brouwer wrote: > > > Clearly, this 10% test is completely bogus. > > > > > > More in particular, I think that a file is binary if it contains even > > > a single NUL byte. > > > > > > Should I propose a patch? > > > > I proposed a patch quite awhile ago for this same problem. Let me see > > if I can dig it up. > > It was originally for ticket 2933. I'm attaching it here. I think I > didn't push it because one of the other committers suggested looking > into libmagic instead, but I'd be interested if this fixes the problem > for you.
> # HG changeset patch > # User Kevin McCarthy <ke...@8t8.us> > # Date 1496701910 25200 > # Mon Jun 05 15:31:50 2017 -0700 > # Node ID 0053fd3b5296e024ff5821a0a697c1f445c7e85a > # Parent a11770c2137b4973efe77b4e9d7356f22d2ae5f7 > Make attachment type guessing more conservative. (closes #2933). > > When guessing the type of an attachment (with no mime type or > extension), mutt currently considers 8-bit characters as "text" when > calculating the percentage of text characters in the file. > > In some cases, such as for the sample attachment in this ticket, it leads > to a binary executable being labelled as text/plain, which results in the > the attachment being corrupted when mailed. > > This patch considers 8-bit characters as binary for the calculation. In > general, it's probably better to guess wrong on the conservative side > than possibly corrupt attachments. > - if (info->lobin == 0 || (info->lobin + info->hibin + info->ascii)/ > info->lobin >= 10) > + if ((info->lobin == 0 && info->hibin == 0) || > + (info->lobin + info->hibin + info->ascii) / (info->lobin + > info->hibin) >= 10) Yes, this fixes my problem. On the other hand, some simple UTF-8 text files are now also treated as binary. Perhaps "hibin" is a misnomer, the world is no longer ascii-only. Slowly, UTF-8 is becoming the most common encoding one meets. My favorite solution would be to always guess on the conservative side and never guess text/plain. But if one wants to have a reasonable guess one needs more data than is collected right now. It is very easy to count NUL-bytes separately and make the occurrence of a NUL byte sufficient for the choice "binary". (That would have been the patch I planned to propose.) It is also very easy to check for well-formed UTF-8. Well-formed UTF-8 with short lines should perhaps be classified as "text". Andries