On Wed, 15 Jan 2014 02:25:34 +0100, Florian Lindner wrote: > Am Dienstag, 14. Januar 2014, 17:00:48 schrieb MRAB: >> On 2014-01-14 16:37, Florian Lindner wrote: >> > Hello! >> > >> > I'm using python 3.2.3 on debian wheezy. My script is called from my >> > mail delivery agent (MDA) maildrop (like procmail) through it's >> > xfilter directive. >> > >> > Script works fine when used interactively, e.g. ./script.py < >> > testmail but when called from maildrop it's producing an infamous >> > UnicodeDecodeError:
What's maildrop? When using third party libraries, it's often helpful to point to give some detail on what they are and where they are from. >> > File "/home/flindner/flofify.py", line 171, in main >> > mail = sys.stdin.read() What's the value of sys.stdin? If you call this from your script: print(sys.stdin) what do you get? Is it possible that the mysterious maildrop is messing stdin up? >> > File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode >> > return codecs.ascii_decode(input, self.errors)[0] >> > >> > Exception for example is always like >> > >> > UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position >> > 869: ordinal not in range(128) That makes perfect sense: byte 0x82 is not in the ASCII range. ASCII is limited to bytes values 0 through 127, and 0x82 is hex for 130. So the error message is telling you *exactly* what the problem is: your email contains a non-ASCII character, with byte value 0x82. How can you deal with this? (1) "Oh gods, I can't deal with this, I wish the whole world was America in 1965 (except even back then, there were English characters in common use that can't be represented in ASCII)! I'm going to just drop anything that isn't ASCII and hope it doesn't mangle the message *too* badly!" You need to set the error handler to 'ignore'. How you do that may depend on whether or not maildrop is monkeypatching stdin. (2) "Likewise, but instead of dropping the offending bytes, I'll replace them with something that makes it obvious that an error has occurred." Set the error handler to "replace". You'll still mangle the email, but it will be more obvious that you mangled it. (3) "ASCII? Why am I trying to read email as ASCII? That's not right. Email can contain arbitrary bytes, and is not limited to pure ASCII. I need to work out which encoding the email is using, but even that is not enough, since emails sometimes contain the wrong encoding information or invalid bytes. Especially spam, that's particularly poor. (What a surprise, that spammers don't bother to spend the time to get their code right?) Hmmm... maybe I ought to use an email library that actually gets these issues *right*?" What does the maildrop documentation say about encodings and/or malformed email? >> > I read mail from stdin "mail = sys.stdin.read()" >> > >> > Environment when called is: >> > >> > locale.getpreferredencoding(): ANSI_X3.4-1968 environ["LANG"]: C For a modern Linux system to be using the C encoding is not a good sign. It's not 1970 anymore. I would expect it should be using UTF-8. But I don't think that's relevant to your problem (although a mis-configured system may make it worse). >> > System environment when using shell is: >> > >> > ~ % echo $LANG >> > en_US.UTF-8 That's looking more promising. >> > As far as I know when reading from stdin I don't need an decode(...) >> > call, since stdin has a decoding. That depends on what stdin actually is. Please print it and show us. Also, can you do a visual inspection of the email that is failing? If it's spam, perhaps you can just drop it from the queue and deal with this issue later. >> > I also tried some decoding/encoding >> > stuff but changed nothing. Ah, but did you try the right stuff? (Randomly perturbing your code in the hope that the error will go away is not a winning strategy.) >> > Any ideas to help me? >> > >> When run from maildrop it thinks that the encoding of stdin is ASCII. > > Well, true. But what encoding does maildrop actually gives me? It > obviously does not inherit LANG or is called from the MTA that way. Who knows? What's maildrop? What does its documentation say about encodings? The fact that it is using ASCII apparently by default does not give me confidence that it knows how to deal with 8-bit emails, but I might be completely wrong. > I also tried: > > inData = codecs.getreader('utf-8')(sys.stdin) > mail = inData.read() > > Failed also. But I'm not exactly an encoding expert. Failed how? Please copy and paste your exact exception traceback, in full. Ultimately, dealing with email is a hard problem. So long as you only receive 7-bit ASCII mail, you don't realise how hard it is. But the people who write the mail libraries -- at least the good ones -- know just how hard it really is. You can have 8-bit emails with no encoding set, or the wrong encoding, or the right encoding but the contents then includes invalid bytes. It's not just spammers who get it wrong, legitimate programmers sending email also screw up. Email is worse than the 90/10 rule. 90% of the effort is needed to deal with 1% of the emails. (More if you have a really bad spam problem.) You should look at a good email library, like the one in the std lib which I believe gets most of these issues right. -- Steven -- https://mail.python.org/mailman/listinfo/python-list