On Mar 18, 9:09 pm, Laszlo Nagy <[EMAIL PROTECTED]> wrote: > Sorry, meanwhile i found that "email.Headers.decode_header" can be used > to convert the subject into unicode: > > > def decode_header(self,headervalue): > > val,encoding = decode_header(headervalue)[0] > > if encoding: > > return val.decode(encoding) > > else: > > return val > > However, there are malformed emails and I have to put them into the > database. What should I do with this: > > Return-Path: <[EMAIL PROTECTED]> > X-Original-To: [EMAIL PROTECTED] > Delivered-To: [EMAIL PROTECTED] > Received: from 195.228.74.135 (unknown [122.46.173.89]) > by shopzeus.com (Postfix) with SMTP id F1C071DD438; > Tue, 18 Mar 2008 05:43:27 -0400 (EDT) > Date: Tue, 18 Mar 2008 12:43:45 +0200 > Message-ID: <[EMAIL PROTECTED]> > From: "Euro Dice Casino" <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Subject: With 2'500 Euro of Welcome Bonus you can't miss the chance! > MIME-Version: 1.0 > Content-Type: text/html; charset=iso-8859-1 > Content-Transfer-Encoding: 7bit > > There is no encoding given in the subject but it contains 0x92. When I > try to insert this into the database, I get: > > ProgrammingError: invalid byte sequence for encoding "UTF8": 0x92 > > All right, this probably was a spam email and I should simply discard > it. Probably the spammer used this special character in order to prevent > mail filters detecting "can't" and "2500". But I guess there will be > other important (ham) emails with bad encodings. How should I handle this?
Maybe with some heuristics about the types of mistakes made by do-it- yourself e-mail header constructors. For example, 'iso-8859-1' often should be construed as 'cp1252': >>> import unicodedata as ucd >>> ucd.name('\x92'.decode('iso-8859-1')) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: no such name >>> ucd.name('\x92'.decode('cp1252')) 'RIGHT SINGLE QUOTATION MARK' >>> -- http://mail.python.org/mailman/listinfo/python-list