Re: Q: View as Windows-1252?

2007-08-03 Thread Kyle Wheeler
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Friday, August  3 at 03:37 PM, quoth Kai Grossjohann:
> I find that I (fairly) often get messages with no charset specified, or 
> with the wrong charset specified, so I do Ctrl-E on them and edit the 
> charset parameter to windows-1252, which seems to work well for most 
> cases.

What you probably want to do is set up some charset-hooks. For 
example:

charset-hook none windows-1252
charset-hook unknown windows-1252
charset-hook x-unknown windows-1252
charset-hook unknown-8bit windows-1252
charset-hook windows-1251 windows-1252

I have a bunch of hooks like this to fix known bad charsets. The 
'assumed_charset' feature is also really really useful:

set assumed_charset=us-ascii:windows-1252:utf-8

And finally, windows is a superset of this other, and this other is 
often incorrectly labelled by brain-dead webmail applications:

charset-hook iso-8859-1 windows-1252

> Is it possble to automate this, so that I only have to press a 
> single key?

The way you'd have to do it is to pipe the message to a script that 
replaces that header.

~Kyle
- -- 
Families are like fudge... mostly sweet with a few nuts.
-- Unknown
-BEGIN PGP SIGNATURE-
Comment: Thank you for using encryption!

iD8DBQFGszQpBkIOoMqOI14RAiqsAJ9Ein01r4CakLLNW2RqM+nMx5DrtgCeKA5y
3iBNiIDhbmFFgBH+UMeOSp4=
=p9kd
-END PGP SIGNATURE-


Q: View as Windows-1252?

2007-08-03 Thread Kai Grossjohann
I find that I (fairly) often get messages with no charset specified, or
with the wrong charset specified, so I do Ctrl-E on them and edit the
charset parameter to windows-1252, which seems to work well for most
cases.

Is it possble to automate this, so that I only have to press a single
key?

What makes it difficult (for me) is that Ctrl-E displays a line that
contains more information than just the charset, and I can't see how to
automatically go to the right spot in that line.

tia,
Kai


Re: Q: View as Windows-1252?

2007-08-03 Thread Kai Grossjohann
On Fri, Aug 03, 2007 at 08:56:57AM -0500, Kyle Wheeler wrote:

> On Friday, August  3 at 03:37 PM, quoth Kai Grossjohann:
> > I find that I (fairly) often get messages with no charset specified, or 
> > with the wrong charset specified, so I do Ctrl-E on them and edit the 
> > charset parameter to windows-1252, which seems to work well for most 
> > cases.
> 
> What you probably want to do is set up some charset-hooks. For 
> example:
> 
> charset-hook none windows-1252
> charset-hook unknown windows-1252
> charset-hook x-unknown windows-1252
> charset-hook unknown-8bit windows-1252
> charset-hook windows-1251 windows-1252

I think this applies to bad charset specifications.  But in my case I
notice that Ctrl-E either shows me charset=utf-8 (where the message is
in Windows-1252), or charset=us-ascii (msg also in Windows-1252).

> I have a bunch of hooks like this to fix known bad charsets. The 
> 'assumed_charset' feature is also really really useful:
> 
> set assumed_charset=us-ascii:windows-1252:utf-8

I didn't use this because it says "only the first content is valid for
the message body".  But I guess it doesn't hurt to try.

Thanks,
Kai


Re: Q: View as Windows-1252?

2007-08-03 Thread Kyle Wheeler
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Friday, August  3 at 04:55 PM, quoth Kai Grossjohann:
> I think this applies to bad charset specifications.  But in my case 
> I notice that Ctrl-E either shows me charset=utf-8 (where the 
> message is in Windows-1252), or charset=us-ascii (msg also in 
> Windows-1252).

Ahh, interesting. Well, the latter is easily remedied; windows-1252 is 
*also* a superset of us-ascii (so this hook won't harm anything):

charset-hook us-ascii windows-1252

The other one is... well, downright malicious! Out of curiosity, what 
mail client is composing messages mislabelled utf8 like that?

>> I have a bunch of hooks like this to fix known bad charsets. The 
>> 'assumed_charset' feature is also really really useful:
>> 
>> set assumed_charset=us-ascii:windows-1252:utf-8
>
> I didn't use this because it says "only the first content is valid for 
> the message body".  But I guess it doesn't hurt to try.

Hmm, that's a badly worded man-page entry. I think it means one of two 
things (both of which are, I think, true): either it's saying that 
only the first charset that is valid for the message will be used 
(i.e. if windows-1252 is a valid way of interpreting the message, 
utf-8 will not be tried---this is especially important for asian 
charsets, where in most cases there's no way to tell if the charset 
produced random garbage or not), OR it's saying that if your message 
comes in multiple parts, the charset that is found to be acceptable 
for the first part will be used for all subsequent parts.

But this won't work at all for you, I think, because it only applies 
to parts of the message without any charset indication, and your 
problem is incorrect charset labelling.

~Kyle
- -- 
Only a mediocre person is always at his best.
   -- Somerset Maugham
-BEGIN PGP SIGNATURE-
Comment: Thank you for using encryption!

iD8DBQFGs0gQBkIOoMqOI14RAmIXAKCujQfpfwPWKXzODG/7V8kUbHAM8QCg/zBz
ixwYbLdNZC2xjwFT6zCQeIE=
=ivnQ
-END PGP SIGNATURE-


Re: Q: View as Windows-1252?

2007-08-03 Thread Kai Grossjohann
On Fri, Aug 03, 2007 at 10:21:52AM -0500, Kyle Wheeler wrote:

> On Friday, August  3 at 04:55 PM, quoth Kai Grossjohann:
> > I think this applies to bad charset specifications.  But in my case 
> > I notice that Ctrl-E either shows me charset=utf-8 (where the 
> > message is in Windows-1252), or charset=us-ascii (msg also in 
> > Windows-1252).
> 
> Ahh, interesting. Well, the latter is easily remedied; windows-1252 is 
> *also* a superset of us-ascii (so this hook won't harm anything):
> 
> charset-hook us-ascii windows-1252
> 
> The other one is... well, downright malicious! Out of curiosity, what 
> mail client is composing messages mislabelled utf8 like that?

I confess that I have no idea.  Actually, I already had a value of
assumed_charset and of charset, perhaps that did it.  I had:

set charset=utf8
set assumed_charset=utf-8:windows-1252:iso-8859-1

Perhaps the order of windows-1252 and iso-8859-1 was reversed.  I
thought that this was a smart move, because if decoding as UTF-8 works,
then it's probably going to be UTF-8.

> >> I have a bunch of hooks like this to fix known bad charsets. The 
> >> 'assumed_charset' feature is also really really useful:
> >> 
> >> set assumed_charset=us-ascii:windows-1252:utf-8
> >
> > I didn't use this because it says "only the first content is valid for 
> > the message body".  But I guess it doesn't hurt to try.
> 
> Hmm, that's a badly worded man-page entry. I think it means one of two 
> things (both of which are, I think, true): either it's saying that 
> only the first charset that is valid for the message will be used 
> (i.e. if windows-1252 is a valid way of interpreting the message, 
> utf-8 will not be tried---this is especially important for asian 
> charsets, where in most cases there's no way to tell if the charset 
> produced random garbage or not),

Hm.  But surely the same thing applies to the header?  So why was it
explicitly talking about the message body?

It seems strange to me to say that it tries all charsets for decoding
the header, even after finding a charset that works.  For then, if more
than one charset works, how would Mutt select one?

> OR it's saying that if your message 
> comes in multiple parts, the charset that is found to be acceptable 
> for the first part will be used for all subsequent parts.

Sounds plausible.

> But this won't work at all for you, I think, because it only applies 
> to parts of the message without any charset indication, and your 
> problem is incorrect charset labelling.

I think I am confused.  Perhaps the situation is this:

The message is sent without a charset indication.  But when I hit
Ctrl-E, a charset is included in the Content-Type header that I can
edit.

And perhaps Mutt was putting utf-8 there after Ctrl-E because that was
the first entry in assumed_charset.

But then, why didn't it try the whole list in the first place?  Then it
would have discovered the correct charset and wouldn't have displayed
question marks for the non-ascii characters.

Very strange situation.  Apologies for not investigating the situation
fully before asking here.

Kai


Re: Q: View as Windows-1252?

2007-08-03 Thread Kyle Wheeler
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Friday, August  3 at 11:29 PM, quoth Kai Grossjohann:
>> The other one is... well, downright malicious! Out of curiosity, 
>> what mail client is composing messages mislabelled utf8 like that?
>
> I confess that I have no idea.  Actually, I already had a value of 
> assumed_charset and of charset, perhaps that did it.  I had:
>
> set charset=utf8 
> set assumed_charset=utf-8:windows-1252:iso-8859-1
>
> Perhaps the order of windows-1252 and iso-8859-1 was reversed.  I 
> thought that this was a smart move, because if decoding as UTF-8 works, 
> then it's probably going to be UTF-8.

Ahhh, no, you're misunderstanding. Think of it this way: the computer 
sees email as just an array of numbers. We like to think of them as 
letters, but they're just numbers. The trick, of course, is that the 
computer has to decide what to display on screen for each number, and 
the problem is that the same number means different things in 
different charsets. So it can do a test and see "does this number mean 
something in this charset?". Better yet, "do all the numbers in this 
email mean something in this charset?" Thus, if there's a number that 
doesn't mean something in the charset (or means something 
undisplayable), it can say "aha! this is the wrong charset".

UTF-8 uses almost the entire set of numbers. In other words, almost 
*any* possible number is valid in UTF-8, and virtually every 
unlabelled email you get will thus be treated as if it was UTF-8, even 
though chances are most of them aren't UTF-8. Now, there's caveats to 
that, because UTF-8 requires specific sequences of numbers in some 
cases (so a message can be detected as not being UTF-8 in some cases), 
but most of the time, most English-speaking folks don't use characters 
that require specific sequences of numbers.

What you want to do instead with assumed_charset is to have it go in 
order of restriction. Start with us-ascii---that's what most English 
emails are sent in anyway, and it's also the most restrictive charset. 
If your email contains a number that's not in that charset, then mutt 
will know to try a different charset. Windows-1252 is a superset of 
us-ascii, so next it will try that. If that works, great, if not, then 
it gets to be time to check for utf-8.

Obviously, this isn't perfect, but the whole point of assumed_charset 
is to be somewhat better at guessing the *correct* charset for 
unlabeled emails.

It's also worth considering what the most common cases are. In the 
English speaking world, MOST email is sent in either us-ascii or 
windows-1252. Sometimes its iso-8859-1, and sometimes it's been 
mislabelled as iso-8859-1. People using good email clients will label 
their charset, but those who use very old or very poorly-written 
clients might not (or might mislabel their charsets). These mail 
clients are unlikely to be doing anything complicated with their 
charsets, and are most likely to be assuming that everyone in the 
world uses some basic charset (such as windows-1252, or us-ascii). 
Mail clients that have put the time and effort into actually 
supporting utf-8 tend to be aware of the problem of unlabelled 
charsets, so it's highly unlikely that you'd find a UTF8-encoded 
message that was not labelled as UTF-8.

There are exceptions everywhere, of course, but those are your common 
cases.

>> Hmm, that's a badly worded man-page entry. I think it means one of 
>> two things (both of which are, I think, true): either it's saying 
>> that only the first charset that is valid for the message will be 
>> used (i.e. if windows-1252 is a valid way of interpreting the 
>> message, utf-8 will not be tried---this is especially important for 
>> asian charsets, where in most cases there's no way to tell if the 
>> charset produced random garbage or not),
>
> Hm.  But surely the same thing applies to the header?  So why was it 
> explicitly talking about the message body?

Like I said: it's badly worded. The same thing applies to the header 
as well.

> And perhaps Mutt was putting utf-8 there after Ctrl-E because that 
> was the first entry in assumed_charset.

Huh. Possible. I've never paid enough attention to that detail of the 
magic mutt pulls for bad email.

> But then, why didn't it try the whole list in the first place?  Then 
> it would have discovered the correct charset and wouldn't have 
> displayed question marks for the non-ascii characters.

Indeed. Well, you may be dealing with a slightly different problem 
then. Sometimes the question marks are mutt's doing, and sometimes 
they're from your terminal (i.e. mutt told it to display character X, 
but the terminal's font doesn't have a picture of that character, so 
the terminal puts up an "I have no idea" character).

If mutt is having trouble, it will do one of two things: either it'll 
replace the trouble character with three question marks (rare), or 
it'll display the octal value of the character preceded by a 
backslash,

Re: Q: View as Windows-1252?

2007-08-03 Thread Kai Grossjohann
On Fri, Aug 03, 2007 at 05:10:57PM -0500, Kyle Wheeler wrote:

> Mail clients that have put the time and effort into actually 
> supporting utf-8 tend to be aware of the problem of unlabelled 
> charsets, so it's highly unlikely that you'd find a UTF8-encoded 
> message that was not labelled as UTF-8.

I think that this is a very good point.  So I don't really need to list
UTF-8 in assumed_charsets at all.

Kai


Re: Q: View as Windows-1252?

2007-08-03 Thread Kyle Wheeler
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Saturday, August  4 at 01:37 AM, quoth Kai Grossjohann:
> On Fri, Aug 03, 2007 at 05:10:57PM -0500, Kyle Wheeler wrote:
>
>> Mail clients that have put the time and effort into actually 
>> supporting utf-8 tend to be aware of the problem of unlabelled 
>> charsets, so it's highly unlikely that you'd find a UTF8-encoded 
>> message that was not labelled as UTF-8.
>
> I think that this is a very good point.  So I don't really need to list 
> UTF-8 in assumed_charsets at all.

True... on the other hand, it can't hurt anything, to be the last one 
in that list.

~Kyle
- -- 
Families are like fudge... mostly sweet with a few nuts.
-- Unknown
-BEGIN PGP SIGNATURE-
Comment: Thank you for using encryption!

iD8DBQFGs95ZBkIOoMqOI14RAsegAJ4yFg0zgwgviLqTp/Xe2TuNbOiBwACgv2I4
YFn4CDhkRW9sBZ2v1e7AvIk=
=J9rn
-END PGP SIGNATURE-