Victor Duchovni: > On Wed, Apr 14, 2010 at 12:54:47PM -0400, Wietse Venema wrote: > > > > I am a bit reluctant at this time to assume that untyped data coming in > > > that looks like UTF-8, really is UTF-8. Even if the LDAP lookup returns > > > plausibly useful results, will the UTF-8 envelope survive related > > > processing in Postfix? > > > > > > - PCRE lookups don't currently request UTF-8 support > > > > Meaning it will blow up, or what? > > When passing UTF-8 data to a regexp engine, we need to tell the engine > that it is handling UTF-8 data, or it may produce match sub-expressions > that consist of pieces of characters. Should "a.b" match a Unicode string > where there is a multibyte character between "a" and "b"? What should ${1} > be for "(a*.)" when "a" is followed by a multi-byte character? > > More generally, the issue is that we need a larger design in which we > have a canonical data representation inside all the pieces of Postfix, > and conversion logic at all system boundaries. This is much bigger than > LDAP lookups.
Speaking of canonical representation, Postfix by design strips off the encapsulation on input (CRLF in SMTP, newline in local submission, and length+value in QMQP) and adds the encapsulation back upon delivery. This is sufficient for 7BIT or 8BITMIME content as we know it today. Note that by doing this, Postfix normalizes only the end-of-line convention, not the payload of the message. This means that with well-formed mail, the SMTP input is guaranteed to be identical to the SMTP output (ignoring the extra Received: header), and so on. I don't think it is necessarily a good idea to "normalize" message and envelope content into a canonical format (UTF-8 or otherwise), do all processing in the canonical domain, and then do another transformation on delivery. More likely, one would transform a non-ASCII lookup string into the character set of the lookup table mechanism and back, whatever that character set might be, and return "not found" when the transformation is not possible or when it is not implemented. Although gateway MTAs have a choice to either downgrade 8BITMIME to 7BIT or return mail as undeliverable, there is no equivalent choice for envelope addresses with non-ASCII localparts. A gateway into today's SMTP world would have to return envelopes with non-ASCII localparts as undeliverable. I would not be surprised if someone will come up with the equivalent of RFC 2047 for SMTP envelope localparts, so that mail can be tunneled through a legacy SMTP infrastructure, between systems that support 8-bit usernames. Wietse