On Mon, Sep 09, 2013 at 08:29:58AM -0400, Peter Eisentraut wrote: > On 9/6/13 10:37 AM, Tom Lane wrote: > > BTW: personally, I would say that what you're looking at is a glibc bug. > > I always thought the contract of gettext was to return the ASCII version > > if it fails to produce a translated version. That might not be what the > > end user really wants to see, but surely returning something like "???" > > is completely useless to anybody. > > The question marks come from iconv. Take a look at what this prints: > > iconv po/ja.po -f utf-8 -t us-ascii//translit > > If you use GNU libiconv, this will print a bunch of question marks.
Actually, GNU libiconv's iconv() decides that //translit is unimplementable for some of the characters in that file, and it fails the conversion. GNU libc's iconv(), on the other hand, emits the question marks. > I think the use of //translit by gettext is poor judgement, because my > experiments show that the quality of the results is poor and not useful > for a user interface. It depends on the quality of the //translit implementation. GNU libiconv's seems pretty good. It gives up for Japanese or Russian characters, so you get the English messages. For Polish, GNU libiconv transliterates like this: msgstr "nie można usunąć pliku lub katalogu \"%s\": %s\n" msgstr "nie mozna usuna'c pliku lub katalogu \"%s\": %s\n" That's fair, considering what it has to work with. Ideally, (a) GNU libc should import the smarter transliteration code from GNU libiconv, and (b) GNU gettext should check for weak //translit implementations and not use //translit under such circumstances. > My suggestion in this matter is to disable gettext processing when > LC_CTYPE is set to C. We could log a warning when LC_MESSAGES is set to > something and LC_CTYPE is set to C. Or just do the warning and keep > logging. Something like that. In an ENCODING=UTF8, LC_CTYPE=C database, no transliteration should need to happen, and no transliteration does happen for the PG messages. I think MauMau's original bind_textdomain_codeset() proposal was on the right track. We would need to do that for every relevant 3rd-party message domain, though. Ick. This suggests to me that gettext really needs an API for overriding the default codeset pertaining to message domains not subjected to bind_textdomain_codeset(). In the meantime, adding bind_textdomain_codeset() calls for known localized dependencies seems like a fine coping mechanism. If we can reasonably detect when gettext is supplying useless ????? messages, that's good, too. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers