Re: Q: View as Windows-1252?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Friday, August 3 at 03:37 PM, quoth Kai Grossjohann: > I find that I (fairly) often get messages with no charset specified, or > with the wrong charset specified, so I do Ctrl-E on them and edit the > charset parameter to windows-1252, which seems to work well for most > cases. What you probably want to do is set up some charset-hooks. For example: charset-hook none windows-1252 charset-hook unknown windows-1252 charset-hook x-unknown windows-1252 charset-hook unknown-8bit windows-1252 charset-hook windows-1251 windows-1252 I have a bunch of hooks like this to fix known bad charsets. The 'assumed_charset' feature is also really really useful: set assumed_charset=us-ascii:windows-1252:utf-8 And finally, windows is a superset of this other, and this other is often incorrectly labelled by brain-dead webmail applications: charset-hook iso-8859-1 windows-1252 > Is it possble to automate this, so that I only have to press a > single key? The way you'd have to do it is to pipe the message to a script that replaces that header. ~Kyle - -- Families are like fudge... mostly sweet with a few nuts. -- Unknown -BEGIN PGP SIGNATURE- Comment: Thank you for using encryption! iD8DBQFGszQpBkIOoMqOI14RAiqsAJ9Ein01r4CakLLNW2RqM+nMx5DrtgCeKA5y 3iBNiIDhbmFFgBH+UMeOSp4= =p9kd -END PGP SIGNATURE-
Q: View as Windows-1252?
I find that I (fairly) often get messages with no charset specified, or with the wrong charset specified, so I do Ctrl-E on them and edit the charset parameter to windows-1252, which seems to work well for most cases. Is it possble to automate this, so that I only have to press a single key? What makes it difficult (for me) is that Ctrl-E displays a line that contains more information than just the charset, and I can't see how to automatically go to the right spot in that line. tia, Kai
Re: Q: View as Windows-1252?
On Fri, Aug 03, 2007 at 08:56:57AM -0500, Kyle Wheeler wrote: > On Friday, August 3 at 03:37 PM, quoth Kai Grossjohann: > > I find that I (fairly) often get messages with no charset specified, or > > with the wrong charset specified, so I do Ctrl-E on them and edit the > > charset parameter to windows-1252, which seems to work well for most > > cases. > > What you probably want to do is set up some charset-hooks. For > example: > > charset-hook none windows-1252 > charset-hook unknown windows-1252 > charset-hook x-unknown windows-1252 > charset-hook unknown-8bit windows-1252 > charset-hook windows-1251 windows-1252 I think this applies to bad charset specifications. But in my case I notice that Ctrl-E either shows me charset=utf-8 (where the message is in Windows-1252), or charset=us-ascii (msg also in Windows-1252). > I have a bunch of hooks like this to fix known bad charsets. The > 'assumed_charset' feature is also really really useful: > > set assumed_charset=us-ascii:windows-1252:utf-8 I didn't use this because it says "only the first content is valid for the message body". But I guess it doesn't hurt to try. Thanks, Kai
Re: Q: View as Windows-1252?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Friday, August 3 at 04:55 PM, quoth Kai Grossjohann: > I think this applies to bad charset specifications. But in my case > I notice that Ctrl-E either shows me charset=utf-8 (where the > message is in Windows-1252), or charset=us-ascii (msg also in > Windows-1252). Ahh, interesting. Well, the latter is easily remedied; windows-1252 is *also* a superset of us-ascii (so this hook won't harm anything): charset-hook us-ascii windows-1252 The other one is... well, downright malicious! Out of curiosity, what mail client is composing messages mislabelled utf8 like that? >> I have a bunch of hooks like this to fix known bad charsets. The >> 'assumed_charset' feature is also really really useful: >> >> set assumed_charset=us-ascii:windows-1252:utf-8 > > I didn't use this because it says "only the first content is valid for > the message body". But I guess it doesn't hurt to try. Hmm, that's a badly worded man-page entry. I think it means one of two things (both of which are, I think, true): either it's saying that only the first charset that is valid for the message will be used (i.e. if windows-1252 is a valid way of interpreting the message, utf-8 will not be tried---this is especially important for asian charsets, where in most cases there's no way to tell if the charset produced random garbage or not), OR it's saying that if your message comes in multiple parts, the charset that is found to be acceptable for the first part will be used for all subsequent parts. But this won't work at all for you, I think, because it only applies to parts of the message without any charset indication, and your problem is incorrect charset labelling. ~Kyle - -- Only a mediocre person is always at his best. -- Somerset Maugham -BEGIN PGP SIGNATURE- Comment: Thank you for using encryption! iD8DBQFGs0gQBkIOoMqOI14RAmIXAKCujQfpfwPWKXzODG/7V8kUbHAM8QCg/zBz ixwYbLdNZC2xjwFT6zCQeIE= =ivnQ -END PGP SIGNATURE-
Re: Q: View as Windows-1252?
On Fri, Aug 03, 2007 at 10:21:52AM -0500, Kyle Wheeler wrote: > On Friday, August 3 at 04:55 PM, quoth Kai Grossjohann: > > I think this applies to bad charset specifications. But in my case > > I notice that Ctrl-E either shows me charset=utf-8 (where the > > message is in Windows-1252), or charset=us-ascii (msg also in > > Windows-1252). > > Ahh, interesting. Well, the latter is easily remedied; windows-1252 is > *also* a superset of us-ascii (so this hook won't harm anything): > > charset-hook us-ascii windows-1252 > > The other one is... well, downright malicious! Out of curiosity, what > mail client is composing messages mislabelled utf8 like that? I confess that I have no idea. Actually, I already had a value of assumed_charset and of charset, perhaps that did it. I had: set charset=utf8 set assumed_charset=utf-8:windows-1252:iso-8859-1 Perhaps the order of windows-1252 and iso-8859-1 was reversed. I thought that this was a smart move, because if decoding as UTF-8 works, then it's probably going to be UTF-8. > >> I have a bunch of hooks like this to fix known bad charsets. The > >> 'assumed_charset' feature is also really really useful: > >> > >> set assumed_charset=us-ascii:windows-1252:utf-8 > > > > I didn't use this because it says "only the first content is valid for > > the message body". But I guess it doesn't hurt to try. > > Hmm, that's a badly worded man-page entry. I think it means one of two > things (both of which are, I think, true): either it's saying that > only the first charset that is valid for the message will be used > (i.e. if windows-1252 is a valid way of interpreting the message, > utf-8 will not be tried---this is especially important for asian > charsets, where in most cases there's no way to tell if the charset > produced random garbage or not), Hm. But surely the same thing applies to the header? So why was it explicitly talking about the message body? It seems strange to me to say that it tries all charsets for decoding the header, even after finding a charset that works. For then, if more than one charset works, how would Mutt select one? > OR it's saying that if your message > comes in multiple parts, the charset that is found to be acceptable > for the first part will be used for all subsequent parts. Sounds plausible. > But this won't work at all for you, I think, because it only applies > to parts of the message without any charset indication, and your > problem is incorrect charset labelling. I think I am confused. Perhaps the situation is this: The message is sent without a charset indication. But when I hit Ctrl-E, a charset is included in the Content-Type header that I can edit. And perhaps Mutt was putting utf-8 there after Ctrl-E because that was the first entry in assumed_charset. But then, why didn't it try the whole list in the first place? Then it would have discovered the correct charset and wouldn't have displayed question marks for the non-ascii characters. Very strange situation. Apologies for not investigating the situation fully before asking here. Kai
Re: Q: View as Windows-1252?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Friday, August 3 at 11:29 PM, quoth Kai Grossjohann: >> The other one is... well, downright malicious! Out of curiosity, >> what mail client is composing messages mislabelled utf8 like that? > > I confess that I have no idea. Actually, I already had a value of > assumed_charset and of charset, perhaps that did it. I had: > > set charset=utf8 > set assumed_charset=utf-8:windows-1252:iso-8859-1 > > Perhaps the order of windows-1252 and iso-8859-1 was reversed. I > thought that this was a smart move, because if decoding as UTF-8 works, > then it's probably going to be UTF-8. Ahhh, no, you're misunderstanding. Think of it this way: the computer sees email as just an array of numbers. We like to think of them as letters, but they're just numbers. The trick, of course, is that the computer has to decide what to display on screen for each number, and the problem is that the same number means different things in different charsets. So it can do a test and see "does this number mean something in this charset?". Better yet, "do all the numbers in this email mean something in this charset?" Thus, if there's a number that doesn't mean something in the charset (or means something undisplayable), it can say "aha! this is the wrong charset". UTF-8 uses almost the entire set of numbers. In other words, almost *any* possible number is valid in UTF-8, and virtually every unlabelled email you get will thus be treated as if it was UTF-8, even though chances are most of them aren't UTF-8. Now, there's caveats to that, because UTF-8 requires specific sequences of numbers in some cases (so a message can be detected as not being UTF-8 in some cases), but most of the time, most English-speaking folks don't use characters that require specific sequences of numbers. What you want to do instead with assumed_charset is to have it go in order of restriction. Start with us-ascii---that's what most English emails are sent in anyway, and it's also the most restrictive charset. If your email contains a number that's not in that charset, then mutt will know to try a different charset. Windows-1252 is a superset of us-ascii, so next it will try that. If that works, great, if not, then it gets to be time to check for utf-8. Obviously, this isn't perfect, but the whole point of assumed_charset is to be somewhat better at guessing the *correct* charset for unlabeled emails. It's also worth considering what the most common cases are. In the English speaking world, MOST email is sent in either us-ascii or windows-1252. Sometimes its iso-8859-1, and sometimes it's been mislabelled as iso-8859-1. People using good email clients will label their charset, but those who use very old or very poorly-written clients might not (or might mislabel their charsets). These mail clients are unlikely to be doing anything complicated with their charsets, and are most likely to be assuming that everyone in the world uses some basic charset (such as windows-1252, or us-ascii). Mail clients that have put the time and effort into actually supporting utf-8 tend to be aware of the problem of unlabelled charsets, so it's highly unlikely that you'd find a UTF8-encoded message that was not labelled as UTF-8. There are exceptions everywhere, of course, but those are your common cases. >> Hmm, that's a badly worded man-page entry. I think it means one of >> two things (both of which are, I think, true): either it's saying >> that only the first charset that is valid for the message will be >> used (i.e. if windows-1252 is a valid way of interpreting the >> message, utf-8 will not be tried---this is especially important for >> asian charsets, where in most cases there's no way to tell if the >> charset produced random garbage or not), > > Hm. But surely the same thing applies to the header? So why was it > explicitly talking about the message body? Like I said: it's badly worded. The same thing applies to the header as well. > And perhaps Mutt was putting utf-8 there after Ctrl-E because that > was the first entry in assumed_charset. Huh. Possible. I've never paid enough attention to that detail of the magic mutt pulls for bad email. > But then, why didn't it try the whole list in the first place? Then > it would have discovered the correct charset and wouldn't have > displayed question marks for the non-ascii characters. Indeed. Well, you may be dealing with a slightly different problem then. Sometimes the question marks are mutt's doing, and sometimes they're from your terminal (i.e. mutt told it to display character X, but the terminal's font doesn't have a picture of that character, so the terminal puts up an "I have no idea" character). If mutt is having trouble, it will do one of two things: either it'll replace the trouble character with three question marks (rare), or it'll display the octal value of the character preceded by a backslash,
Re: Q: View as Windows-1252?
On Fri, Aug 03, 2007 at 05:10:57PM -0500, Kyle Wheeler wrote: > Mail clients that have put the time and effort into actually > supporting utf-8 tend to be aware of the problem of unlabelled > charsets, so it's highly unlikely that you'd find a UTF8-encoded > message that was not labelled as UTF-8. I think that this is a very good point. So I don't really need to list UTF-8 in assumed_charsets at all. Kai
Re: Q: View as Windows-1252?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Saturday, August 4 at 01:37 AM, quoth Kai Grossjohann: > On Fri, Aug 03, 2007 at 05:10:57PM -0500, Kyle Wheeler wrote: > >> Mail clients that have put the time and effort into actually >> supporting utf-8 tend to be aware of the problem of unlabelled >> charsets, so it's highly unlikely that you'd find a UTF8-encoded >> message that was not labelled as UTF-8. > > I think that this is a very good point. So I don't really need to list > UTF-8 in assumed_charsets at all. True... on the other hand, it can't hurt anything, to be the last one in that list. ~Kyle - -- Families are like fudge... mostly sweet with a few nuts. -- Unknown -BEGIN PGP SIGNATURE- Comment: Thank you for using encryption! iD8DBQFGs95ZBkIOoMqOI14RAsegAJ4yFg0zgwgviLqTp/Xe2TuNbOiBwACgv2I4 YFn4CDhkRW9sBZ2v1e7AvIk= =J9rn -END PGP SIGNATURE-