Hi there, Thank you for the detailed and considered response. I have replied inline to your questions/comments.
On Tue, Apr 26, 2022 at 04:18:51PM -0500, Derek Martin wrote: > On Mon, Apr 25, 2022 at 11:08:41AM +0000, Joel Buckley wrote: > > Hi all, > > > > I have been using mutt for some time on a VT510 terminal (similar to > > https://en.wikipedia.org/wiki/VT520), and enjoying it. > > An actual serial hardware terminal? Those are getting to be rare > beasts indeed... ;-) Indeed, it was very hard to track down a working unit for sale in the past year or so. I'm really happy with how much I've been able to integrate it into my daily workflow --- more information available at https://blog.joelbuckley.com.au/2021/07/os-x-vt220-part-1 if you're interested in the setup. > > The display does not support UTF-8, so I had > > LC_ALL="en_US.ISO8859-1" in my ~/.profile. This worked well for > > mutt. > > So here you say the terminal doesn't support UTF-8... Perhaps it's more accurate for me to to say that the terminal doesn't have a setting for UTF-8, and won't support the full range of characters that could be sent along the serial port to the terminal. I have the terminal set to ISO8859-1, and would like to restrict all terminal applications to this character set if possible. > > I then discovered that by changing mutt to load with > > LC_ALL="en_US.UTF-8" that all was well. > > Huh? These two things seem to be contradictory... > > Also, I'm assuming this message was sent from Mutt NOT using your > non-UTF-8-supporting terminal, since it is indeed encoded in UTF-8 and > contains actual UTF-8 characters... Yes, I sent the last email (and this one) from a modern machine which supports UTF-8, and allows me to deal with longer emails more comfortably. A 24 line screen has its limitations! The idea behind me saying 'when I tell the machine to make outputs UTF-8' is that somewhere, in the depths of the machine before I see my output, it is converting characters in a better way than when I tell the machine that it should limit itself to ISO8859-1. The terminal then renders what the machine gives it over serial, and most of it works either way, but these issues with quotes and ticks remain. > Anyway, getting back to the normal order of things... > > > However today I received an email with the string "Don=E2=80=99t > > know when I will be there next.". This should display as something > > like "Don't know where I will be there next.". In my mutt terminal, > > it displayed: > > > "Don???t know when I will be there next". > > The issue is that there are no curly quotes in iso8859-1. Both > Windows and Mac support a modified version of iso8859-1 that includes > curly quotes, but unfortunately use different character codes for > them. These character sets have their own names, but frequently mail > applications are misconfigured to label them iso8859-1, because > they're mostly identical and it works most places--as long as you're > on the same platform as the sender. Agree, and that's why I had made this display filter script. However, and I think you really hit the nail on the head a little lower down, it seems that the content is changed even before display_filter has a chance to perform a search for a replacement. > > Thinking this was odd, I dove into my filter.sh script, and > > discovered that no end of hacking would enable me to filter out the > > '=E2=80=99' before display --- there seemed to be some amount of > > parsing before my filter got ahold of it. All that I could match on > > was '???', despite being able to edit the content of the mail > > itself, and see the string '=E2=80=99'. My filter line of > > significance is: > > > output=`echo "$output" | sed "s/[’‘]/$(echo "27" | xxd -p -r)/g"` > > This replaces 'smart quotes' with their ASCII equivalents. > > Given that you already have a display filter script, this isn't a > horrible solution--assuming it actually worked. Note that you have a > couple of harmless bugs though: > > 1. You've doubled up your double quotes, so actually 27 is not quoted. > It's harmless, but you don't need this anyway: > 2. You needlessly fork two additional processes--one for the subshell > for echo, another for xxd. This can be greatly simplified to: > > echo "$output" | sed "s/[’‘]/'/g" > > Presumably you avoided this because the single quote is "special" > to the shell, but since in this case it is enclosed in double > quotes it loses its specialness. Good pick-up, thanks. This was a recent change because I was tearing my hair out when the smart quotes were appearing as '???' no matter what I had in the script. I was concerned that I had written a non-ASCII character, and wanted to be absolutely sure of the ASCII code it was outputting during debugging. I will fix this script up once I resolve the underlying charset issue. > > Thinking that this would be a matter of ensuring that the filter > > script had the right character support, I added "export > > LC_ALL="en_US.UTF-8"" to the top of my filter script, however this > > did nothing for me. > > Your filter script will run with the same locale as mutt, since it is > a subprocess--it inherits the locale from its parent. So if mutt were > indeed started with LC_ALL=en_US.UTF-8 then so too will your display > filter. But you shouldn't need to do any of this... > > > After some messing around, it seemed that the > > only way to get mutt to support the filtering of my problematic > > string was to call mutt itself with the required character encoding > > (UTF-8). > > What character set is the message itself encoded with (according to > its headers)? If your terminal is set up right, and the charset on > the message is correct, then Mutt should be taking care of this > already for you by running iconv on the message. Basically, except in > rare cases, if your terminal is set up properly, you shouldn't ever > need to deal with character sets explicitly. > > Is this correct and best-practice, or have I missed something here? > > My installation is currently working by using the 'export > > LC_ALL="en_US.UTF-8"' line in my ~/.profile, however this feels like > > bad practice > > Because it is. > > But I think you may have one of the rare cases. I think what's > happening is Mutt is correctly running iconv to convert your message > from UTF-8 (which it most likely is in) to iso8859-1, which partially > fails due to the annoying curly quotes, and then passes it to your > filter script, which runs on that but it is already converted to '?' > due to the character not having an equivalent in iso8859-1. > > Assuming that's true, the only thing I can think of is an old trick > that iconv supports, which I vaguely remember using in Mutt *ages* > ago. Try explicitly setting $charset *IN MUTT* to > ISO-8859-1//TRANSLIT, which might or might not help. But it's likely > to have other negative effects... This absolutely rings true for me. During debugging of my display_filter, it seemed that my sticking point was that changing the $LC_ALL shell variable changed the input to the display_filter script. This is ultimately what I am confused about and am hoping for a way around --- understanding the black box that sits between a) mutt opening the file from the disk and b) display_filter being called with some input, whose characteristics seem to change based on the shell value of $LC_ALL. I have tried `set charset=ISO-8859-1//TRANSLIT`, thank you for the suggestion. I was entirely unaware of this option. The result removed an apostrophe entirely, converting a single opening smart quote to a backtick, and left a closing single smart quote as '???'. Very mixed results. I think to crack this nut, it would be worthwhile to understand more the iconv call that you mentioned. The existence of the iconv call is news to me, and I wonder if there are any configurable parameters for that? For example, is there a pre-iconv filter script? This could be one way to solve my problems (by filtering out the characters/strings that iconv seems to be choking on). Thanks for your thorough reply, Derek. It is much appreciated and I will continue to mull on the ideas above. > -- > Derek D. Martin http://www.pizzashack.org/ GPG Key ID: 0xDFBEAD02 > -=-=-=-=- > This message is posted from an invalid address. Replying to it will result in > undeliverable mail due to spam prevention. Sorry for the inconvenience. > Regards, -- Joel Buckley