Re: python-list@python.org

Steven D'Aprano Wed, 15 Jan 2014 16:42:04 -0800

On Wed, 15 Jan 2014 02:25:34 +0100, Florian Lindner wrote:

> Am Dienstag, 14. Januar 2014, 17:00:48 schrieb MRAB:
>> On 2014-01-14 16:37, Florian Lindner wrote:
>> > Hello!
>> >
>> > I'm using python 3.2.3 on debian wheezy. My script is called from my
>> > mail delivery agent (MDA) maildrop (like procmail) through it's
>> > xfilter directive.
>> >
>> > Script works fine when used interactively, e.g. ./script.py <
>> > testmail but when called from maildrop it's producing an infamous
>> > UnicodeDecodeError:


What's maildrop? When using third party libraries, it's often helpful to 
point to give some detail on what they are and where they are from.


>> > File "/home/flindner/flofify.py", line 171, in main
>> >       mail = sys.stdin.read()

What's the value of sys.stdin? If you call this from your script:
 
print(sys.stdin)

what do you get? Is it possible that the mysterious maildrop is messing 
stdin up?


>> > File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
>> >       return codecs.ascii_decode(input, self.errors)[0]
>> >
>> > Exception for example is always like
>> >
>> > UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position
>> > 869: ordinal not in range(128) 

That makes perfect sense: byte 0x82 is not in the ASCII range. ASCII is 
limited to bytes values 0 through 127, and 0x82 is hex for 130. So the 
error message is telling you *exactly* what the problem is: your email 
contains a non-ASCII character, with byte value 0x82.

How can you deal with this?

(1) "Oh gods, I can't deal with this, I wish the whole world was America 
in 1965 (except even back then, there were English characters in common 
use that can't be represented in ASCII)! I'm going to just drop anything 
that isn't ASCII and hope it doesn't mangle the message *too* badly!"

You need to set the error handler to 'ignore'. How you do that may depend 
on whether or not maildrop is monkeypatching stdin.


(2) "Likewise, but instead of dropping the offending bytes, I'll replace 
them with something that makes it obvious that an error has occurred."

Set the error handler to "replace". You'll still mangle the email, but it 
will be more obvious that you mangled it.


(3) "ASCII? Why am I trying to read email as ASCII? That's not right. 
Email can contain arbitrary bytes, and is not limited to pure ASCII. I 
need to work out which encoding the email is using, but even that is not 
enough, since emails sometimes contain the wrong encoding information or 
invalid bytes. Especially spam, that's particularly poor. (What a 
surprise, that spammers don't bother to spend the time to get their code 
right?) Hmmm... maybe I ought to use an email library that actually gets 
these issues *right*?"

What does the maildrop documentation say about encodings and/or malformed 
email?


>> > I read mail from stdin "mail = sys.stdin.read()"
>> >
>> > Environment when called is:
>> >
>> > locale.getpreferredencoding(): ANSI_X3.4-1968 environ["LANG"]: C

For a modern Linux system to be using the C encoding is not a good sign. 
It's not 1970 anymore. I would expect it should be using UTF-8. But I 
don't think that's relevant to your problem (although a mis-configured 
system may make it worse).


>> > System environment when using shell is:
>> >
>> > ~ % echo $LANG
>> > en_US.UTF-8

That's looking more promising.


>> > As far as I know when reading from stdin I don't need an decode(...)
>> > call, since stdin has a decoding. 

That depends on what stdin actually is. Please print it and show us.

Also, can you do a visual inspection of the email that is failing? If 
it's spam, perhaps you can just drop it from the queue and deal with this 
issue later.


>> > I also tried some decoding/encoding
>> > stuff but changed nothing.

Ah, but did you try the right stuff? (Randomly perturbing your code in 
the hope that the error will go away is not a winning strategy.)


>> > Any ideas to help me?
>> >
>> When run from maildrop it thinks that the encoding of stdin is ASCII.
> 
> Well, true. But what encoding does maildrop actually gives me? It
> obviously does not inherit LANG or is called from the MTA that way.

Who knows? What's maildrop? What does its documentation say about 
encodings? The fact that it is using ASCII apparently by default does not 
give me confidence that it knows how to deal with 8-bit emails, but I 
might be completely wrong.


> I also tried:
> 
>         inData = codecs.getreader('utf-8')(sys.stdin) 
>         mail = inData.read()
> 
> Failed also. But I'm not exactly an encoding expert.

Failed how? Please copy and paste your exact exception traceback, in full.

Ultimately, dealing with email is a hard problem. So long as you only 
receive 7-bit ASCII mail, you don't realise how hard it is. But the 
people who write the mail libraries -- at least the good ones -- know 
just how hard it really is. You can have 8-bit emails with no encoding 
set, or the wrong encoding, or the right encoding but the contents then 
includes invalid bytes. It's not just spammers who get it wrong, 
legitimate programmers sending email also screw up.

Email is worse than the 90/10 rule. 90% of the effort is needed to deal 
with 1% of the emails. (More if you have a really bad spam problem.) You 
should look at a good email library, like the one in the std lib which I 
believe gets most of these issues right.


-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python-list@python.org

Reply via email to