Re: UTF8 national character data type support WIP patch and list of open issues.

Chapman Flack Sun, 27 Aug 2023 09:56:42 -0700

Hi,

Although this is a ten-year-old message, it was the one I found quickly
when looking to see what the current state of play on this might be.


On 2013-09-20 14:22, Robert Haas wrote:

Hmm.  So under that design, a database could support up to a total of
two character sets, the one that you get when you say 'foo' and the
other one that you get when you say n'foo'.

I guess we could do that, but it seems a bit limited.  If we're going
to go to the trouble of supporting multiple character sets, why not
support an arbitrary number instead of just two?


Because that old thread came to an end without mentioning how the
standard approaches that, it seemed worth adding, just to complete the
record.

In the draft of the standard I'm looking at (which is also around a
decade old), n'foo' is nothing but a handy shorthand for _csname'foo'
(which is a syntax we do not accept) for some particular csname that
was chosen when setting up the db.

So really, the standard contemplates letting you have columns of
arbitrary different charsets (CHAR(x) CHARACTER SET csname), and
literals of arbitrary charsets _csname'foo'. Then, as a bit of
sugar, you get to pick which two of those charsets you'd like
to have easy shorter ways of writing, 'foo' or n'foo',
CHAR or NCHAR.

The grammar for csname is kind of funky. It can be nothing but
<SQL language identifier>, which has the nice restricted form
/[A-Za-z][A-Za-z0-9_]*/. But it can also be schema-qualified,
with the schema of course being a full-fledged <identifier>.

So yeah, to fully meet this part of the standard, the parser'd
have to know that
 _U&"I am a schema nameZ0021" UESCAPE 'Z'/*hi!*/.LATIN1'foo'
is a string literal, expressing foo, in a character set named
LATIN1, in some cutely-named schema.

Never a dull moment.

Regards,
-Chap

Re: UTF8 national character data type support WIP patch and list of open issues.

Reply via email to