On Wed, Apr 14, 2010 at 12:54:47PM -0400, Wietse Venema wrote: > > I am a bit reluctant at this time to assume that untyped data coming in > > that looks like UTF-8, really is UTF-8. Even if the LDAP lookup returns > > plausibly useful results, will the UTF-8 envelope survive related > > processing in Postfix? > > > > - PCRE lookups don't currently request UTF-8 support > > Meaning it will blow up, or what?
When passing UTF-8 data to a regexp engine, we need to tell the engine that it is handling UTF-8 data, or it may produce match sub-expressions that consist of pieces of characters. Should "a.b" match a Unicode string where there is a multibyte character between "a" and "b"? What should ${1} be for "(a*.)" when "a" is followed by a multi-byte character? More generally, the issue is that we need a larger design in which we have a canonical data representation inside all the pieces of Postfix, and conversion logic at all system boundaries. This is much bigger than LDAP lookups. > > - Logs don't support non-destructive recording of UTF-8 > > envelopes. > > I expect that in the long term, UTF-8 will be the canonical > representation of text in *NIX files, and that we should plan > for that future. Yes, of course. The LDAP IS_ASCII check will be easy to remove, and and LDAP supports Unicode, so that will be the easy part, but first we need a "contract" that all inputs to the dictionary layer are UTF-8, and the "dict_<your-type-here>" clients will need to ensure that this is so. After that, we can just let the UTF-8 data flow into the database engine if supported, or try to translate to the database charset if not. Probably each table's charset is declared as part of the table configuration, and the generic dictionary layer handles translation of inputs and outputs... Anyway, I am still reluctant to make use of UTF-8 without a larger context in which this makes sense. -- Viktor. P.S. Morgan Stanley is looking for a New York City based, Senior Unix system/email administrator to architect and sustain our perimeter email environment. If you are interested, please drop me a note.