Re: mailbox misbehavior with non-ASCII

Peter J. Holzer Sat, 30 Jul 2022 13:22:22 -0700

On 2022-07-29 23:24:57 +0000, Peter Pearson wrote:
> The following code produces a nonsense result with the input 
> described below:
> 
> import mailbox
> box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
> x = box.values()[0]
> h = x.get("X-DSPAM-Factors")
> print(type(h))
> # <class 'email.header.Header'>
> 
> The output is the desired "str" when the message file contains this:
> 
> To: [email protected]
> Message-ID: <123>
> Date: Sun, 24 Jul 2022 15:31:19 +0000
> Subject: Blah blah
> From: [email protected]
> X-DSPAM-Factors: a'b
> 
> xxx
> 
> ... but if the apostrophe in "a'b" is replaced with a
> RIGHT SINGLE QUOTATION MARK, the returned h is of type 
> "email.header.Header", and seems to contain inscrutable garbage.


It's not inscrutable to me, but then I remember when RFC 1522 was the
relevant RFC.

Calling h.encode() returns

=?unknown-8bit?b?YeKAmWI=?=

which is about the best result you can get. The character set is unknown
and the content (when decoded) is the bytes

61 e2 80 99 62

which is what your file contained (assuming you used UTF-8).

What would be nice if you could get at that content directly. There
doesn't seem to be documented method to do that. You can use h._chunks,
but as the _ in the name implies, that's implementation detail which
might change in future versions (and it's not quite straightforward
either, although consistent with other parts of python, I think).

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | [email protected]         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

signature.asc
Description: PGP signature

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: mailbox misbehavior with non-ASCII

Reply via email to